Introduction

A little history about Enron company

Enron is a natural-gas-transmission company founded in 1985 in the US. In 1990’s the US congress adopt a series of law to deregulate the sale of natural gas. This makes Enron loosing it’s exclusivity right on the natural gas pipeline. At this time Jeffrey Skilling, who was initially a consultant and later became the company’s chief operating officer, transformed Enron into a trader energy derivative to be an intermediary between natural-gas producers and their customers. Soon after that, Enron become a leader in this market and makes huge profit on its trade. This golden age for the company allow them to recruit Andrew Fastow who quickly became the chief financial officer. Moreover, they diversify their activity to include electricity, coal, paper, and steel. Perhaps, success have is limit and in late 90’s the company profit start to shrank… The pressure from shareholders, company executives began to rely on dubious accounting practices. Especially they used the “market-to-market accounting” which allowed the company to write unrealized future gain from some trading contract into current income statement, thus giving the illusion of higher current profits. In August, 2001 some people at the head of the company start to worry about a possible accounting scandals due to this practice. In October, 2001 the Securities and Exchange Commission began investigating the transactions of Enron. This was the starting event who lead the company to the bankruptcy which really start in December, 2001.

Source Britannica Enron scandal.

Project aims

The principal aim of this project is to explore the Enron’s email data set for extracting insight about the fiscal fraud investigation and bankruptcy of the company in 2001. For that have 3 data sets:

  • the employee list with their email address

  • the emails exchange from 1999 to 2002

  • the recipients of each emails (to, cc, bcc).

The different insight will are available into a shiny apps.

For that project we used several libraries listed here: For data exploration, analysis and visualization:

To display the result into the Rmarkdown report:

To create the shiny apps:

#library
library(tidyverse)
library(circlize)
library(wordcloud)
library(ggpubr)
library(patchwork)
library(gridExtra)
library(grid)
library(gtable)
library(ggbreak)
library(knitr)
library(shiny)

#dataset
load(file = "C:/Users/marie/Documents/DSTI_Cours/R_big_Data/Exam/Enron_project/Enron.Rdata")
#function to extract the legend from each plot
get_legend <- function(p, #the plot need to be arrange on a same layout and shared the same legend
                       nrow=2 #the number of row where the legend will be display, by default 2
                       ){
  
  #override the guides to control the number of rows in legend
  p_wrapped <- p + guides(
    #allow to control how the legend is arrange 
    fill = guide_legend(nrow = nrow, byrow = TRUE),
    color = guide_legend(nrow = nrow, byrow = TRUE))
  
  #generate a temporary table with the graphical component
  temp <- ggplotGrob(p_wrapped)
  
  #extract the legend, guide-box, and store it in a list
  legend <- temp$grobs[which(sapply(temp$grobs, function(x) x$name) == "guide-box")]
  
  #return only one legend not the list of them
  return(legend[[1]])
} 

Data exploring and cleaning

First look at the data

The aim of this part is to see :

  • which kind of data the different table contains

  • the existence of missing value and how to handle them

employee dataset

Description of the data set variables and dimension:

dim_employee <- dim(employeelist)

summary(employeelist)
##       eid          firstName           lastName           Email_id        
##  Min.   :  1.00   Length:149         Length:149         Length:149        
##  1st Qu.: 38.00   Class :character   Class :character   Class :character  
##  Median : 75.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 75.07                                                           
##  3rd Qu.:112.00                                                           
##  Max.   :150.00                                                           
##                                                                           
##     Email2             Email3             EMail4             folder         
##  Length:149         Length:149         Length:149         Length:149        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##             status  
##  Employee      :41  
##  N/A           :31  
##  Vice President:23  
##  Director      :14  
##  Manager       :14  
##  (Other)       :25  
##  NA's          : 1

This data set contain 149 rows and 9 columns.

This data set contains employee ID (eid), the first and last name of the employee as well as their status, the email addresses for each employee, and the folder where their email are stored. In the status variable there exist missing value’s identify by R (NA) but also putting directly in the data by the set owner which are write N/A. The eid variable is identify has type numeric, status is associate with a factor type and the other variable are character type.

Display of some observations in the data frame:

kable(employeelist[1:10, ])
eid firstName lastName Email_id Email2 Email3 EMail4 folder status
13 Marie Heard heard-m NA
6 Mark Taylor taylor-m Employee
19 Lindy Donoho donoho-l Employee
115 Lisa Gang gang-l N/A
129 Jeffrey Skilling skilling-j CEO
18 Lynn Blair blair-l Director
33 Kim Ward ward-k N/A
149 Kate Symes symes-k Employee
52 Kay Mann mann-k Employee
21 Keith Holst holst-k Director

By looking at the head of the data, we observed that eid is associate to numeric data type but the more adapted type seems to be factor because it is an ID for employee. In addition, the variables Email2, Email3, EMail4 contain a lot of blank.

To investigate the blank we temporary change the datatype of those variables from character to factor to see what kind of result we return for the blank observation.

kable(employeelist %>% transform(
  Email2 = as.factor(Email2),
  Email3 = as.factor(Email3),
  EMail4 = as.factor(EMail4)
) %>% summary())
eid firstName lastName Email_id Email2 Email3 EMail4 folder status
Min. : 1.00 Length:149 Length:149 Length:149 :52 :100 :147 Length:149 Employee :41
1st Qu.: 38.00 Class :character Class :character Class :character a..shankman@enron.com : 1 a..martin@enron.com : 1 j..kean@enron.com : 1 Class :character N/A :31
Median : 75.00 Mode :character Mode :character Mode :character : 1 : 1 : 1 Mode :character Vice President:23
Mean : 75.07 NA NA NA : 1 : 1 NA NA Director :14
3rd Qu.:112.00 NA NA NA b..sanders@enron.com : 1 : 1 NA NA Manager :14
Max. :150.00 NA NA NA : 1 : 1 NA NA (Other) :25
NA NA NA NA (Other) :92 (Other) : 44 NA NA NA’s : 1

We can see that, in the Email2, Email3, and EMail4 variable don’t have missing value but they are blank character. In the Email3 and EMail4 more than the half of the value are blank, maybe those variable aren’t very helpful for the analysis. In the variable status the NA are differently declared where we have 31 values with N/A and only 1 NA. For that variable we will need to replace the N/A by real NA values to homogenized the data.

message data set

Description of the data set variables and dimension:

dim_message <- dim(message)

kable(summary(message))
mid sender date message_id subject
Min. : 52 : 6273 Min. :0001-05-30 : 1 Length:252759
1st Qu.: 88565 : 5838 1st Qu.:2000-12-01 : 1 Class :character
Median :186421 : 5100 Median :2001-05-21 : 1 Mode :character
Mean :190260 : 4797 Mean :1999-04-15 : 1 NA
3rd Qu.:279962 : 4437 3rd Qu.:2001-10-25 : 1 NA
Max. :404927 : 3686 Max. :2044-01-04 : 1 NA
NA (Other) :222628 NA (Other) :252753 NA

This data set contain 252759 rows and 5 columns.

Here we observed that, the mid and date variables identify as a numeric, the variables sender and message_id are attached to factor data type, and the variable subject is character data type.

Display of some observations in the data frame:

kable(message[1:10, ])
mid sender date message_id subject
52 2000-01-21 ENRON HOSTS ANNUAL ANALYST CONFERENCE PROVIDES BUSINESS OVERVIEW AND GOALS FOR 2000
53 2000-01-24 Over $50 – You made it happen!
54 2000-01-24 Over $50 – You made it happen!
55 2000-02-02 ROAD-SHOW.COM Q4i.COM CHOOSE ENRON TO DELIVER FINANCIAL WEB CONTENT
56 2000-02-07 Fortune Most Admired Ranking
57 2000-08-25 WPTF Friday Credo Veritas Burrito
58 2000-06-21 SAP ID - Here it is!!!!!
59 2000-06-27 Set of Graphs
60 2000-07-25 Block Forward Financial Trades
61 2000-07-27 Block forwards

By looking at the head of the data we observed that, the mid don’t look like numeric data but more has identifier like the eid variable in the employeelist table. In the data frame the date variable is associate to a date type. More over it seems that the observation in the subject variable are repeat several time suggesting they aren’t individual string but more a categorical variable.

Because the description seems to treat the variable date as a numeric type but the observation look like real date in the data display above we check with the class() function if R treat it correctly by evaluating if his data type is Date:

class(message$date) == "Date"
## [1] TRUE

The result confirm us R treat the date variable in the good data type meaning Date type. For this variable it is not necessary to adapt the data type.

In the date variable the min and max values return are strange date. In the introduction we saw that the data cover the period between 1999 and 2002 and those value aren’t in that period.

To understand what is those values we filter the table to get the year is less than 1999 or more than 2002:

kable(message %>% 
  select(date) %>% #keep the date variable
  mutate(year = format(date,"%Y")) %>% #extract the year from the date
  filter((year < 1999) | (year > 2002)) %>% #keep the value below and after the study's period
  group_by(year) %>% count()) #count the number of rows per date out of the study's period
year n
0001 205
0002 53
1979 6
1997 1
1998 85
2004 53
2007 1
2020 2
2043 1
2044 3

In filtering the strange date we can see that some aren’t date (0001, 0002) and the other are out of the study’s period. This represent average 450 values which makes less than 1% of the observations in the table.

The variable mid and message_id could be redundancy. To verify that we will count the number of distinct value for both variable to see if a mid could be attached to several message_id.

kable(message%>% select(mid, message_id) %>% #select only the variable we need.
  transform(mid = as.factor(mid)) %>% #transform the mid into factor data type.
  group_by(message_id) %>% 
  count(mid) %>% #count the number of mid per message_id group, create a n variable with the result.
  filter(n != 1)) #filter to get the rows with a value different than 1.
message_id mid n

This shown that, each message_id is attached to one and only one mid and confirm to us the redundancy of the 2 variables in the data frame. To lighten the data we can choose one of them to be kept in the dataframe for the analysis.

As we saw in the table header me have email address of the email’s sender in the sender variable. Those email address are also in the employeelist where it as for most of the employee their status in the company but there are split into 4 different variable. In addition, the variable Email3 and EMail4 contain a lot of blank value. To see how we will can merge the two table we look at the correspondance between the 2 tables for the email address.

#prepared table to only check which email address in the Email_ID are also in the sender
employee_merge1 <- employeelist %>% mutate(sender = Email_id) %>% select(sender)
employee_merge2 <- employeelist %>% mutate(sender = Email2) %>% select(sender)
employee_merge3 <- employeelist %>% mutate(sender = Email3) %>% select(sender)
employee_merge4 <- employeelist %>% mutate(sender = EMail4) %>% select(sender)

#to do the join only with the sender variable
message_merge <- message %>% select(sender)
#first between the sender in the message table and the Email_id in the employeelist
EmailID_sender1 <- inner_join(message_merge, employee_merge1, by = "sender")

EmailID_sender1 %>% count()
##        n
## 1 104766
#between the sender in the message table and the Email2 in the employeelist
EmailID_sender2 <- inner_join(message_merge, employee_merge2, by = "sender")

EmailID_sender2 %>% count()
##   n
## 1 0
#between the sender in the message table and the Email3 in the employeelist
EmailID_sender3 <- inner_join(message_merge, employee_merge3, by = "sender")

EmailID_sender3 %>% count()
##      n
## 1 1170
#between the sender in the message table and the EMail4 in the employeelist
EmailID_sender4 <- inner_join(message_merge, employee_merge4, by = "sender")

EmailID_sender4 %>% count()
##   n
## 1 0

By using the inner_join we can see that, in the employeelist table only the variable Email_id and Email3 have email address which are also in the sender variable of the message table. If we want to get the status of the employee status attached to the sender email address we need to do the merge with those variable.

recipient info data set

Description of the data set variables and dimension:

dim_recipient <- dim(recipientinfo)

summary(recipientinfo)
##       rid               mid         rtype        
##  Min.   :     67   Min.   :    52   BCC: 253713  
##  1st Qu.: 718289   1st Qu.:105438   CC : 253735  
##  Median :1515296   Median :198263   TO :1556994  
##  Mean   :1543862   Mean   :196168                
##  3rd Qu.:2309682   3rd Qu.:280673                
##  Max.   :3242063   Max.   :404927                
##                                                  
##                        rvalue       
##  no.address@enron.com     :  19198  
##  jeff.dasovich@enron.com  :  11137  
##  richard.shapiro@enron.com:  11015  
##  steven.j.kean@enron.com  :  10873  
##  james.d.steffes@enron.com:  10615  
##  tana.jones@enron.com     :   9781  
##  (Other)                  :1991823

This data set contain 2064442 rows and 4 columns. The summary of the data reveal that, the rid and mid are consider as numeric variable by R and the variables rtype and rvalue are consider as factor data type.

Display of some observations in the data frame:

rid mid rtype rvalue
67 52 TO
68 53 TO
69 54 TO
70 55 TO
71 56 TO
72 56 TO
73 57 TO
74 58 TO
75 59 TO
76 60 TO

By looking at the head of this dataset we can see that rid and mid are identifier, with the result return by the summary function we need to transform those variables into factor data for having in the good type. Also, the mid variable is a foreign key allowed to link this table with the message table. Binding together this 2 table will allow us to have the sender and the receiver of the email as well as which type of receiver (direct with the to or “indirect” with the CC and BCC). The last variable rvalue is the email address of the receiver which can be general (e.g., , see in the head of the table) or specific to a person (e.g., , see as the top specific receiver in the summary of that table). The specific email address in the rsender variable can be find in the email addresses in the employeelist variable related to the email address of each employee to get their status in the company. We proceed as with the message table.

#prepared table to only check which email address in the Email_ID are also in the sender
employee_merge1 <- employeelist %>% mutate(rvalue = Email_id) %>% select(rvalue)
employee_merge2 <- employeelist %>% mutate(rvalue = Email2) %>% select(rvalue)
employee_merge3 <- employeelist %>% mutate(rvalue = Email3) %>% select(rvalue)
employee_merge4 <- employeelist %>% mutate(rvalue = EMail4) %>% select(rvalue)

#to do the join only with the sender variable
recipient_merge <- recipientinfo %>% select(rvalue)
#first between the rvalue in the recipient table and the Email_id in the employeelist
EmailID_recipient1 <- inner_join(recipient_merge, employee_merge1, by = "rvalue")

EmailID_recipient1 %>% count()
##        n
## 1 361234
# between the rvalue in the recipient table and the Email2 in the employeelist
EmailID_recipient2 <- inner_join(recipient_merge, employee_merge2, by = "rvalue")

EmailID_recipient2 %>% count()
##   n
## 1 0
#between the rvalue in the recipient table and the Email3 in the employeelist
EmailID_recipient3 <- inner_join(recipient_merge, employee_merge3, by = "rvalue")

EmailID_recipient3 %>% count()
##      n
## 1 2382
#first between the rvalue in the recipient table and the EMail4 in the employeelist
EmailID_recipient4 <- inner_join(recipient_merge, employee_merge4, by = "rvalue")

EmailID_recipient4 %>% count()
##   n
## 1 0

Like in the message table, we only have match between the rvalue and the Email_id and Email3 variable.

reference info data set

Description of the data set variables and dimension:

dim_reference <- dim(referenceinfo)

summary(referenceinfo)
##       rfid            mid          reference        
##  Min.   :    2   Min.   :    79   Length:54778      
##  1st Qu.:14305   1st Qu.: 60580   Class :character  
##  Median :30987   Median :178176   Mode  :character  
##  Mean   :30860   Mean   :179738                     
##  3rd Qu.:46728   3rd Qu.:275557                     
##  Max.   :63024   Max.   :404920

This data set contain 54778 rows and 3 columns.

the summary pointed that, the variable rfid and mid are qualified as numeric type and the reference variable as a character type.

Display of some observations in the data frame:

kable(referenceinfo[5:10, ])
rfid mid reference
5 14 845 From: Monaco, John [EM] [mailto:john.monaco@citi.com]Sent: Thursday, March 07, 2002 6:40 AMTo: Badeer, RobertSubject: FW: RE: Whats up!!!!!Still around!!!!—–Original Message—–From: [mailto:enron.mailsweeper.admin@enron.com] Sent: Thursday, March 07, 2002 9:36 AMTo: Monaco, John [EM]Subject: RE:RE: Whats up!!!!!The enron.com recipient(s) moved to a new organization. The new email address follows the (as per their original enron.comemail address). Email sent to recipient(s) at enron.com will not bedelivered.
6 15 846 From: Rangel, Ina Sent: Thursday, March 07, 2002 8:11 AMTo: Badeer, RobertSubject: Expense ReceiptsBob:I received your expense receipts today. Will submit them today.Ina Rangel
7 16 847 From: Grigsby, Mike Sent: Friday, March 08, 2002 9:08 AMTo: Badeer, RobertSubject: RE: BADGEGo with Ina —–Original Message—–From: Badeer, Robert Sent: Friday, March 08, 2002 11:08 AMTo: Grigsby, MikeSubject: RE: BADGEGrigs, Ina said it would be on the 5th floor of the new building. Which is right? —–Original Message—–From: Grigsby, Mike Sent: Friday, March 08, 2002 6:46 AMTo: Badeer, RobertSubject: BADGEYour badge will be waiting for you at the front desk in the north tower on mon. if not, then call and we will retrieve you.Michael D. Grigsby, Executive DirectorUBS Warburg Energy, LLCWork: 713-853-7031Mobile: 713-408-6256
8 17 848 From: Grigsby, Mike Sent: Friday, March 08, 2002 6:46 AMTo: Badeer, RobertSubject: BADGEYour badge will be waiting for you at the front desk in the north tower on mon. if not, then call and we will retrieve you.Michael D. Grigsby, Executive DirectorUBS Warburg Energy, LLCWork: 713-853-7031Mobile: 713-408-6256
9 18 849 From: Rangel, Ina Sent: Thursday, March 07, 2002 12:56 PMTo: Badeer, RobertSubject: FW: Badge AccessWhen you get here on Monday morning, come to the 5th floor reception of the new building. If your badge is not there, then I will come and pick you up when you get here and bring you up. Your badge will be ready Monday for sure, whether it be morning or afternoon I am not sure of.-Ina —–Original Message—–From: Curless, Amanda Sent: Thursday, March 07, 2002 2:50 PMTo: Rangel, InaSubject: RE: Badge AccessIna,We can most likely have this by Monday morning and he can pick this up at the 5th floor reception. If he has any problems he can call me. Thanks!Mandy —–Original Message—–From: Rangel, Ina Sent: Thursday, March 07, 2002 2:39 PMTo: Curless, AmandaSubject: RE: Badge Access << File: Badge Access Form.doc >> I filled out all of the information that I had on him. Will he be able to have his badge by Monday morning and where will he go to pick it up.Ina —–Original Message—–From: Curless, Amanda Sent: Thursday, March 07, 2002 2:00 PMTo: Rangel, InaSubject: Badge Access << File: Badge Access Form.doc >> Ina,Pleae fill out and return to me at ECS 05848. You can e-mail this to me if this is easier. Thanks!Mandy
10 19 851 From: Hyatt, Kevin Sent: Wednesday, July 25, 2001 1:00 PMTo: Nielsen, JeffSubject: RE: Mid 4 to Mid 3 QuoteJeff, can you fill in the rates for the 5,7, and 10 year terms for me. These would be notional of course. Let me know if you have questions.thxKevin 713-853-5559 Term/yrs. 2 5 7 10 Demand: Firm* $.02 - .03 $.04-.05 $.06-.07 $.07-.08 TI $.035 - .045 \(.065-\).075 $.075-.085 $.095-.105 Volume is min. 0 to max of 200,000/d * plus minimum commodity Primary to El Paso Waha would be slightly higher Rates are plus fuel —–Original Message—–From: Nielsen, Jeff Sent: Monday, July 23, 2001 4:39 PMTo: Hyatt, KevinSubject: Mid 4 to Mid 3 QuoteKevin,Jo Williams said that you needed a quote for transportation from Mid 4 to Mid 3 in the Waha area. On a firm basis we would be would in the $.02 to $.03 demand range plus minimum commodity. For a TI rate use between $.035 and $.045. If you would like primary to El Paso Waha, that rate would be a little higher. We have been able to get additional value out of that interconnect because of the gas prices in California. Please let me know if you need any additional information.Jeff 402-398-7434

By looking at the head of that table we can see that:

  • the rfid and mid aren’t numeric variable but look like identifier. It will be necessary to change their data type for factor for it be better adapted.

  • the reference in the referenceinfo table is a variable describing the content of each message. It has also the mid variable which allow us to merge that table with the message and/or the recipientinfo table.

  • in the message and recipientinfo table we have email address like in the employeelist info. We could thinks that, this table can be merged through this.

By exploring those data set we identify some issues needed to be handle before the analysis such as data type change, missing values handling, variable redundancy, and data set merging.

We choose to :

  • Change the data type of the identifier variable in the different table from numeric to factor.

  • Change the data type of the subject variable from character to factor.

  • Withdraw the message_id variable in the message table to lighten the dataset. In addition we drop the lines for which the date aren’t in the study’s period (from 1999 to 2002) and the strange date.

  • Withdraw the variable Email2 and EMail4 variable in the employeelist table because they doesn’t match with the email address in the message and recipientinfo table.

  • Even the referenceinfo table isn’t exhaustive because it contain only 54,778 observation which makes only 2% of the recipientinfo table. We will can analyse a few part of the email exchange.

  • Creates a table which bind all the information about the message by merging together the table message, referenceinfo and recipientinfo through the mid foreign key.

  • We choose to keep the NA in the status for the sender and the receiver. This will allow us to have all the information about the exchange. If they are drop we could loose informations.

Data engineering and cleaning

Employeelist table

employeelist_2 <- employeelist %>% 
  select(-c(Email2, EMail4)) %>% #the variable we don't need in the data
  transform(eid = as.factor(eid)) %>% #data type change for the variable eid to factor
  mutate(status = if_else((status == "N/A"), NA, status)) #homogenized the declaration of the NA in the variable status

Description of the new table employee list:

summary(employeelist_2)
##       eid       firstName           lastName           Email_id        
##  1      :  1   Length:149         Length:149         Length:149        
##  2      :  1   Class :character   Class :character   Class :character  
##  3      :  1   Mode  :character   Mode  :character   Mode  :character  
##  4      :  1                                                           
##  5      :  1                                                           
##  6      :  1                                                           
##  (Other):143                                                           
##     Email3             folder                     status  
##  Length:149         Length:149         Employee      :41  
##  Class :character   Class :character   Vice President:23  
##  Mode  :character   Mode  :character   Director      :14  
##                                        Manager       :14  
##                                        Trader        :13  
##                                        (Other)       :12  
##                                        NA's          :32

Verification of the data type of the table variables:

#return the data type for every variable in the table
str(employeelist_2)
## 'data.frame':    149 obs. of  7 variables:
##  $ eid      : Factor w/ 149 levels "1","2","3","4",..: 13 6 19 115 129 18 33 148 52 21 ...
##  $ firstName: chr  "Marie" "Mark" "Lindy" "Lisa" ...
##  $ lastName : chr  "Heard" "Taylor" "Donoho" "Gang" ...
##  $ Email_id : chr  "marie.heard@enron.com" "mark.e.taylor@enron.com" "lindy.donoho@enron.com" "lisa.gang@enron.com" ...
##  $ Email3   : chr  "" "e.taylor@enron.com" "" "" ...
##  $ folder   : chr  "heard-m" "taylor-m" "donoho-l" "gang-l" ...
##  $ status   : Factor w/ 10 levels "CEO","Director",..: NA 3 3 NA 1 2 NA 3 3 2 ...

The result from summary and the str function show us the data type change, the NA homogenized, and the suppression of the variable is done correctly. We can now used this table to pursue the analysis.

message table

message_2 <- message %>%
  select(-c(message_id)) %>% #withdraw the variable we don't need
  transform(#change the data type for factor
    mid = as.factor(mid),
    sender = as.factor(sender),
    subject = as.factor(subject)) %>%
  #add the year variable in the table from the date
  mutate(year = as.factor(format(date, "%Y"))) %>% 
  #filter to keep only the date from 1999 to 2002
  filter(year %in% c(1999 : 2002)) %>% #drop the year variable which is no more useful in the data
  select(-year)

recipientinfo

recipientinfo_2 <- recipientinfo %>%
  #change the variable data type for factor
  transform(rid = as.factor(rid),
            rvalue = as.factor(rvalue),
    mid = as.factor(mid))

referenceinfo

referenceinfo_2 <- referenceinfo %>%
  #change the variable data type for factor
  transform(rfid = as.factor(rfid),
    mid = as.factor(mid))

Merging the employee status with the df_message table

In first we do it for the sender with Email_id

#prepared the employeelist table for the merge
employee_merge_final <- employeelist_2 %>% 
  select(Email_id, status) %>% #keep only the variables we need
  mutate(status_sender = status) %>% #rename the status variable to know to who is attached the status
  select(-status)

#merged with the df_message table 
df_message_status <- left_join(df_message, employee_merge_final, 
                               join_by(sender == Email_id))

#verification the merged work
df_message_status %>% filter(!is.na(status_sender)) %>% count()
##        n
## 1 294291

Then we do it for the sender with Email3

#prepared the employeelist table for the merge
employee_merge_final2 <- employeelist_2 %>% 
  select(Email3, status) %>% #keep only the variables we need
  mutate(status_sender_email3 = status) %>% #rename the status variable to know to who is attached the status
  select(-status)

#merged with the df_message table 
df_message_status <- left_join(df_message_status, employee_merge_final2, 
                               join_by(sender == Email3))

#verification the merged work
df_message_status %>% filter(!is.na(status_sender_email3)) %>% count()
##      n
## 1 2034

group all the sender status in to one variable

df_message_status <- df_message_status %>% mutate(
  #replace the NA value in the variable by the value in the 2nd variable
  status_sender = if_else((is.na(status_sender) == TRUE), status_sender_email3, status_sender)) %>% select(-status_sender_email3) #drop the variable

#verification the merged work
df_message_status %>% filter(!is.na(status_sender)) %>% count()
##        n
## 1 296325

With this operation we attached 296 325 sender’s email address to their employee status.Next we the same for the recipient.

In first we do it for the recipient with Email_id

#prepared the employeelist table for the merge
employee_merge_final_recipient <- employeelist_2 %>% 
  select(Email_id, status) %>% #keep only the variables we need
  mutate(status_recipient = status) %>% #rename the status variable to know to who is attached the status
  select(-status)

#merged with the df_message table 
df_message_status <- left_join(df_message_status, employee_merge_final_recipient, 
                               join_by(rvalue == Email_id))

#verification the merged work
df_message_status %>% filter(!is.na(status_recipient)) %>% count()
##        n
## 1 291737

Then we do it for the recipient with Email3

#prepared the employeelist table for the merge
employee_merge_final_recipient2 <- employeelist_2 %>% 
  select(Email3, status) %>% #keep only the variables we need
  mutate(status_recipient_email3 = status) %>% #rename the status variable to know to who is attached the status
  select(-status)

#merged with the df_message table 
df_message_status <- left_join(df_message_status, employee_merge_final_recipient2, 
                               join_by(rvalue == Email3))

#verification the merged work
df_message_status %>% filter(!is.na(status_recipient_email3)) %>% count()
##      n
## 1 2382

group all the recipient status in to one variable

df_message_status <- df_message_status %>% mutate(
  #replace the NA value in the variable by the value in the 2nd variable
  status_recipient = if_else((is.na(status_recipient) == TRUE), status_recipient_email3, status_recipient)) %>% 
  select(-status_recipient_email3) #drop the variable

#verification the merged work
df_message_status %>% filter(!is.na(status_recipient)) %>% count()
##        n
## 1 294119

By doing this we identify the status of 294 119 employee receiving the email.

Now all the information we need are group in the same data frame, we look at the period which is cover by email content in the reference variable

start <- df_message %>% filter(!is.na(reference)) %>% select(date) %>%
  arrange(date) %>% head(n=1)


end <- df_message %>% filter(!is.na(reference)) %>% select(date) %>%
  arrange(desc(date)) %>% head(n=1)

length_email_content <- df_message %>% filter(!is.na(reference)) %>% count()

We have 268524 with the 1st message is the 1999-05-07 and the last the 2002-07-12. We will can analyse the content a part of message exchange between the Enron employee over this period.

To facilitate the analysis and lightening the data frame we withdraw the identifier columns which aren’t more useful for us and change the name of the rvalue variable for recipient to be more meaning full.

df_message_status <- df_message_status %>% 
  #withdraw the variable which are identifier
  select(-c(mid, rfid, rid)) %>%
  #change the name of the recipient email variable
  mutate(recipient = rvalue) %>%
  #order the different variable
  select(date, sender, status_sender, rtype, recipient, status_recipient, subject, reference)
#cleaning of the object no more necessary in the environment
rm(employeelist, message, message_2, recipientinfo, recipientinfo_2, referenceinfo, referenceinfo_2, df_message_missing, message_merge, recipient_merge, EmailID_sender1, EmailID_sender2, EmailID_sender3, EmailID_sender4, EmailID_recipient1, EmailID_recipient2, EmailID_recipient3, EmailID_recipient4, employee_merge1, employee_merge2, employee_merge3, employee_merge4, end, start, length_email_content, employee_merge_final, employee_merge_final2, employee_merge_final_recipient, employee_merge_final_recipient2, dim_employee, dim_message, dim_recipient, dim_reference)

Data analysis

#in this part we will draw many plot, every will have the same theme
theme_set(theme_light())

We start to make a global picture of the cleaned data we have.

Emailcount <- count(df_message_status %>% filter(rtype == "TO") %>% distinct(sender, recipient, subject, reference))
Reply <- count(df_message_status %>% filter(str_detect(subject, "^RE:")) %>% distinct(sender, recipient, subject, reference))
emailExchangeStatus <- count(df_message_status %>% distinct(sender, status_sender, recipient, status_recipient, subject, reference) %>% filter(!is.na(status_sender)|!is.na(status_recipient)))

In this data set we have 17502 sender and 68065 recipient. The high difference between the number of sender and recipient suggest that an email involved several person. We have 908177 different direct email exchange where 9.82 % are reply to former email. This suggest that, most of the email are information send or received. Few are real exchange between worker, maybe at this time the worker communicate through other way such as the telephone. Moreover, among the total email exchange only for 31.44 % we know the status of the sender or the recipient in Enron suggested it has a lot of email which come from external source and/or worker with unidentified status. Is also possible some email are address to email list which group several employee in the company. For those we can’t know the status of the worker.

enronEmailAdd <- count(df_message_status %>% filter((str_detect(sender,"@enron")) | (str_detect(recipient,"@enron"))) %>% distinct(sender, recipient))
Estimation_generalEmailAdd <- count(df_message_status %>% 
                                      #key word regularly used for general email address name and see in the sender or recipient variable
                                      filter(str_detect(sender,                                                                     "^enron|^press|^office|^all|^announcement|^communications|affair|client|contact|secur|team|comit|^west|energy") | str_detect(recipient, "^enron|^press|^office|^all|^announcement|^communications|affair|client|contact|secur|team|comit|^west|energy")))
Exchange_ext_enron <- count(
  #extract the variable we need
  df_message_status %>% select(date, sender, recipient, subject, reference) %>% 
    #count for each the sender and recipient whose have an enron email address
    mutate(count_sender = if_else(str_detect(sender, "@enron"), 1, 0),
  count_recipient = if_else(str_detect(recipient, "@enron"), 1,0)) %>% 
    #for each date and subject for each date make the sum of the sender and recipient with an enron email address
    group_by(date, subject) %>% mutate(
      sum_sender = sum(count_sender),
      sum_recipient = sum(count_recipient)) %>% ungroup() %>%
    #isolate the email exchange which not involved person with an enron email address
    filter((sum_sender ==0) & (sum_recipient == 0)))

In our data set we have 256100 which are email send by or/and address to an Enron email address. In fact the Enron company possess a lot of branches which has their on email domain this could be the reason why we only have average 30% of the email address in those email exchange are with an Enron email domain. We can also estimate in those exchange average 63879 are email send or address to a general email address which cover several different worker at Enron or in one of it’s branch. We observed that, 25212 are email send and address to person without Enron domain in their email address. Those exchange represent average 1% of the total email in the data set.

#count the number of email address without enron domain for the sender
c1 <- df_message_status %>% distinct(sender) %>% mutate(
  count_tot_sender = n(),
  count_ext_sender = if_else((!str_detect(sender, "@enron")), 1, 0),
  #count_ext_recipient = if_else((!str_detect(recipient, "@enron")), 1, 0),
  sum_ext_sender = sum(count_ext_sender),
  pct_ext_sender = paste0(round((sum_ext_sender/count_tot_sender)*100), "%")
  #sum_ext_recipient = sum(count_ext_recipient)
  ) %>% distinct(sum_ext_sender, pct_ext_sender)

#count the number of email address without enron domain for the recipient
c2 <- df_message_status %>% distinct(recipient) %>% mutate(
  count_tot_recipient = n(),
  count_ext_recipient = if_else((!str_detect(recipient, "@enron")), 1, 0),
  sum_ext_recipient = sum(count_ext_recipient),
  pct_ext_recipient = paste0(round((sum_ext_recipient/count_tot_recipient)*100), "%")
  ) %>% distinct(sum_ext_recipient, pct_ext_recipient)

#bind the both count in the same dataframe
cbind(c1, c2)
##   sum_ext_sender pct_ext_sender sum_ext_recipient pct_ext_recipient
## 1          11457            65%             39321               58%

This highlight that, more than the half of the sender and recipient haven’t an email address with an enron domain. This suggest the email exchange are can be more between enron worker and the client of the company. It is also possible the email are send or address to personal email address of the enron worker, maybe in the case of informal exchange.

With this first picture of the data we can deduce that:

  • The data set we have isn’t exhaustive for the status of employee in the company as well as the content of the email.

  • A lot of exchange is done with external worker. Perhaps, most of the exchange involved enron worker where less than 10% are email send and address to email address which aren’t with an Enron domain.

  • It seems, few email are real exchange between employee where we have few email containing “RE:” in their subject.

  • A few part of the email exchange seems to be between person whose are extern from the enron company. The part they represent in the total data set is neglated, we will keep them in the data set to pursue the analysis.

With this we decide to make the analysis in taking in account the worker without status to don’t loose any information about the email exchange as well as to keep the external email address.

the employee liste

To explore the number employee per different status we have, we used the employeelist2 data frame.

Number of employee per status :

employeelist_2 %>% select(status) %>% #select the needed variable
  group_by(status) %>% count() %>% #count the number of employee per status
  ungroup() %>%
  #calculate the percentage for each status
  mutate(perc = `n`/sum(`n`),
  labels = scales::percent(perc)) %>%
  #bar chart
  ggplot(aes(reorder(status, perc ,sum),perc, fill = status)) +
  geom_bar(stat = "identity") +
  #to invert the axis's position
  coord_flip()+ 
  #customize the theme, title and axis labels
  geom_text(aes(label = labels), vjust = 0.5, size = 4) + #display the percentage for each category at the end of the corresponding bar
  scale_y_continuous(labels = scales::percent_format())+
  ggtitle("Number of employee per status in Enron company")+
  labs(y = "Percentage (%)",x = "Employee status") +
  scale_fill_brewer(palette = "Set3", 
                    #to display the NA in grey on the graph
                    na.value = "grey50"
                    )+
  theme(legend.position = "none")

The above bar chart shows us that:

  • most of the employee have an employee or unknown status (respectively 27.48% and 21.48%)

  • they have few lawyer (less than 1% of the total number of employee)

  • surprisingly a lot of employee have a vice president status (average 15% of the employee)

  • it has a similar number of manager, director, and trader in the company (average 9% for each)

  • at the head of the company it has several CEO, President, and managing director (average 2% for each)

After that we look at the email exchange in the period of study In first we extract from the date the month and year and put them into different variable.

df_message_status <- df_message_status %>% 
  mutate(year = format(date,"%Y"), #extract the year from the date
         month = format(date, "%m")) %>% #extract the month from the date 
  transform( #to put the variable in wright type
    year = as.factor(year),
    month = as.factor(month))
df_message_status %>% group_by(year,month)%>%
  count() %>%
  ggplot(aes(month, n, group = year, color = year))+
  geom_line(size = 1)+
  scale_y_continuous(labels = scales::label_comma())+
  labs(title = "Number of email send/receive per month by the Enron's worker",
       x = "Month",
       y = "Email count per month")+
  scale_fill_brewer(palette = "Set3")

The above plot shown that:

  • For the year 1999 the email exchange is low. We find the same rate in April, 2002.

  • Over the year 2000 the number of email exchange between Enron’s worker increase gradually to be at his higher level in November, 2000.

  • In the year 2001 we see a pick of email exchange during April and May. This period in 2001 is when the fiscal fraud start to be discover. Then the number of exchange decrease during the summer to gain the pick in October which is also the period when the company is under the SEC investigation.

  • The email exchange stop in May 2002. Maybe the date when the company was completely close. At the start of the 2002 (in January and February) we still see a high number of email exchange. Maybe this is due to the achievement of the fiscal fraud investigation and it’s consequences for the company.

Description of the number of email send and receive

First of all in the df_message we count the distinct email address for the sender and recipient as well as often they appear in the table:

#count the number of disctint sender email address
sender_count <- df_message_status %>% select(sender) %>% #keep only the variable we need
  distinct(sender) %>% #keep only once each email address 
  count() #count them
#count the number of disctint recipient email address
recipient_count <- df_message_status %>% select(recipient) %>% distinct(recipient) %>% count()

In the df_message table we observed that their exist 68065 different email address for receiver and 17502 different email address for sender. The important difference between them suggest one email is address to several person.

To picture in the company who is the type of Enron’s worker the most active in the email exchange we look at the number of email send and receive by each status and them compared them.

Start with the email send.

#compute the number of email send per day per employee statuts
violin_worker <- df_message_status %>% filter(!is.na(status_sender)) %>%
  group_by(date, status_sender) %>%
  summarise(email_count = n(), .groups = "drop")

#violin plot 
ggplot(violin_worker, aes(as.factor(status_sender), email_count, fill = as.factor(status_sender))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  ylim(c(0,250))+
  stat_compare_means(method = "anova", label.y = 250, size = 4)+
  labs(title = "Comparison of the number of email send email in function 
       of the enron's worker statuts",
       x = "Source",
       y = "Email count per day") +
  theme(legend.position = "none")

The above plot shown us that, the employee are those who send the higher number of email in the company. The anova test show us the difference between the group is significant.

Table with the descriptive statistic for each group

#descriptive statistics between the worker status group
violin_worker %>% group_by(status_sender)%>%
  summarise(
    mean = mean(email_count),
    sd = sd(email_count),
    min = min(email_count),
    Q1 = quantile(email_count, 0.25),
    Q3 = quantile(email_count, 0.75),
    max = max(email_count)
  )
## # A tibble: 9 × 7
##   status_sender       mean     sd   min    Q1    Q3   max
##   <fct>              <dbl>  <dbl> <int> <dbl> <dbl> <int>
## 1 CEO                37.7  284.       1     3  17    4740
## 2 Director           27.7   41.4      1     3  39     298
## 3 Employee          159.   271.       1    13 186.   4085
## 4 In House Lawyer     7.29   7.12     1     2   9      35
## 5 Manager            47.9   69.0      1    11  62    1044
## 6 Managing Director  10.7   32.2      1     2   8     455
## 7 President          29.6   75.5      1     3  26     988
## 8 Trader             17.6   24.0      1     4  23     307
## 9 Vice President     74.5  116.       1    12  89.8  1014
#statistical comparison between group
pairwise.t.test(violin_worker$email_count, violin_worker$status_sender, 
                #adjust the p.value with bonferroni because the number of group is small
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  violin_worker$email_count and violin_worker$status_sender 
## 
##                   CEO     Director Employee In House Lawyer Manager
## Director          1.000   -        -        -               -      
## Employee          < 2e-16 < 2e-16  -        -               -      
## In House Lawyer   1.000   1.000    < 2e-16  -               -      
## Manager           1.000   1.000    < 2e-16  0.154           -      
## Managing Director 1.000   1.000    < 2e-16  1.000           0.017  
## President         1.000   1.000    < 2e-16  1.000           1.000  
## Trader            1.000   1.000    < 2e-16  1.000           0.032  
## Vice President    0.022   7.0e-05  < 2e-16  5.7e-05         0.047  
##                   Managing Director President Trader 
## Director          -                 -         -      
## Employee          -                 -         -      
## In House Lawyer   -                 -         -      
## Manager           -                 -         -      
## Managing Director -                 -         -      
## President         1.000             -         -      
## Trader            1.000             1.000     -      
## Vice President    2.5e-08           8.3e-05   2.9e-09
## 
## P value adjustment method: bonferroni

The tables above describe the number of email send per day for each status and compared each group. This confirm the first observations shows the violin where:

  • the employee are the status who send significantly the higher number of email per day in average.The Employee are also the bigger group of worker in the company. Maybe this influence the result.

  • After them, it is the vice president and the manager who send the higher number of email per day. Maybe this is related to there roles in the company.

Perhaps, we pointed previously the employee is the bigger group in the Enron’s company. To confirm they are the most active group in the company in the email sending we rationalize the number of email send per day for each group in function of the number of Enron’s worker per group.

#Filter to get only the worker with a knowing status
df_message_status %>% filter(!is.na(status_sender)) %>%
  group_by(date, status_sender) %>% 
  #count the number of email send per day per group as well as the distinct number of worker in each group at this date
  mutate(
    nb_send = n(),#count for each group the total number of sender for a date
    nb_sender_per_gp = n_distinct(sender) #for each status count the number of different sender email address we have for a date
  ) %>% ungroup()%>% 
  #made the ratio between the email send per day for each status and the number of distinct sender in that status for that day
  mutate(ratio_nb_email_pctg = nb_send/nb_sender_per_gp) %>%
  #violin box plot
  ggplot(aes(status_sender, ratio_nb_email_pctg, fill = status_sender)) +
  geom_violin(trim = FALSE)+
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  labs(title = "Comparison of the email send in function of the Enron's worker statuts.",
       subtitle = "Ratio to the number of worker per group.",
       x = "Source",
       y = "Ratio email per status")+
  theme(legend.position = "none")

If we rationalized the number of email send per day in function of the number it seems in general the amount of email send per day is close to 0. Maybe between 0 and 10 for the 1st quantile. Surprinsingly, it is the CEO who sent in average the higher number of email per day. Which is contradictory with what we observed previously in looking at the raw number of email per day in function of the worker status. Perhaps the violin plot suggest an important difference between the lower and the higher amount of email sent per day for them. Maybe the average is push higher because of some extreme values.

df_message_status %>% filter(!is.na(status_sender)) %>%
  group_by(date, status_sender) %>% mutate(
  nb_send = n(),
  nb_sender_per_gp = n_distinct(sender)) %>% ungroup()%>% 
  mutate(ratio_nb_email_pctg = nb_send/nb_sender_per_gp) %>%
  distinct(date,status_sender, sender, nb_send, nb_sender_per_gp, ratio_nb_email_pctg) %>% 
  group_by(status_sender)%>% summarise(
    mean = mean(ratio_nb_email_pctg),
    median = median(ratio_nb_email_pctg),
    sd = sd(ratio_nb_email_pctg),
    min = min(ratio_nb_email_pctg),
    Q1 = quantile(ratio_nb_email_pctg, 0.25),
    Q3 = quantile(ratio_nb_email_pctg, 0.75),
    max = max(ratio_nb_email_pctg)
  )
## # A tibble: 9 × 8
##   status_sender      mean median     sd   min    Q1    Q3   max
##   <fct>             <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CEO               32.0    7    189.       1  3    15    2370 
## 2 Director          12.0    7     15.6      1  3    14.5   194 
## 3 Employee          23.7   16.1   25.9      1 10.7  25.7   348 
## 4 In House Lawyer    7.29   5      7.12     1  2     9      35 
## 5 Manager           11.3    8.43  13.0      1  5.17 13.2   201.
## 6 Managing Director  9.96   3.5   25.7      1  2     7.5   228.
## 7 President         20.3    9     59.0      1  3    18     988 
## 8 Trader             7.59   5      8.03     1  2.67  9.12   81 
## 9 Vice President    15.3   11.2   14.9      1  6.8  18.3   206

After rationalized the number of email send per worker in the group we can see that, the average of CEO is around 32 email per day with a median at 7 and the average for the employee is around 23 with a median at 16 suggesting the average for the CEO is push higher by some extreme values. Effectively, the max for the CEO is 2,370 and for the Employee it is 348. This could be the reason why the CEO seems to be the group who sent the higher number of email per day. To understand why it has this extreme value we research the date link to it.

To understand what happen we look closely to the CEO group and highlight the 10 higher values for the number of email send.

df_message_status %>% filter(!is.na(status_sender)) %>%
  group_by(date, status_sender) %>% mutate(
  nb_send = n(),
  nb_sender_per_gp = n_distinct(sender)) %>% ungroup()%>% 
  mutate(ratio_nb_email_pctg = nb_send/nb_sender_per_gp) %>%
  filter(status_sender == "CEO") %>% 
  distinct(date,status_sender, sender, nb_send, nb_sender_per_gp, ratio_nb_email_pctg) %>% 
  filter(ratio_nb_email_pctg == "2370")
## # A tibble: 2 × 6
##   date       status_sender sender   nb_send nb_sender_per_gp ratio_nb_email_pctg
##   <date>     <fct>         <chr>      <int>            <int>               <dbl>
## 1 2001-08-23 CEO           kenneth…    4740                2                2370
## 2 2001-08-23 CEO           david.w…    4740                2                2370

Effectively the maximum number of email send by the CEO was in August, 2001 the period where the CEO start to be worried about the risk of the fiscal fraud could be discover by the fiscal authorities.

#environment cleaning
rm(jeff_stat, sender_stat, statuts_stat, p1, p2, p3, p4, violin_plot, violin_plot1, violin_plot2, violin_worker)

Now we look at the email received by each Enron’s worker status

#compute the number of email send per day per employee statuts
violin_worker <- df_message_status %>%   filter(!is.na(status_recipient)) %>%
  group_by(date, status_recipient) %>%
  summarise(email_count = n(), .groups = "drop")

#violin plot 
ggplot(violin_worker, aes(as.factor(status_recipient), email_count, fill = as.factor(status_recipient))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  ylim(c(0,250))+
  labs(title = "Comparison of the email count between the enron's worker statuts",
       x = "Source",
       y = "Email count per day") +
  theme(legend.position = "none")

The employee, manager, and vice president seems to be the workers group in Enron’s company who receive the higher number of email. It seems that, the in house lawyer are those who receive the less number of email per days. The difference between group is significant.

Descriptive statistics and comparison between groups:

#descriptive statistics between the worker statuts group
violin_worker %>% group_by(status_recipient)%>%
  summarise(
    mean = mean(email_count),
    median = median(email_count),
    sd = sd(email_count),
    min = min(email_count),
    Q1 = quantile(email_count, 0.25),
    Q3 = quantile(email_count, 0.75),
    max = max(email_count)
  )
## # A tibble: 9 × 8
##   status_recipient   mean median     sd   min    Q1    Q3   max
##   <fct>             <dbl>  <dbl>  <dbl> <int> <dbl> <dbl> <int>
## 1 CEO               11.6       6  15.3      1     2  15     197
## 2 Director          35.6      18  61.7      1     5  38     676
## 3 Employee          98.6      40 156.       1     7 122.   1333
## 4 In House Lawyer    5.64      3   8.14     1     1   6.5    62
## 5 Manager           42.2      28  53.1      1    10  55     438
## 6 Managing Director 18.0       6  30.4      1     2  18     178
## 7 President         22.9      10  32.4      1     3  29     224
## 8 Trader            39.8      12  70.6      1     3  42     538
## 9 Vice President    85.8      32 130.       1     7 122.   1140
#statistical comparison between group
pairwise.t.test(violin_worker$email_count, violin_worker$status_recipient, 
                #adjust the p.value with bonferroni because the number of group is small
                p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  violin_worker$email_count and violin_worker$status_recipient 
## 
##                   CEO     Director Employee In House Lawyer Manager
## Director          9.4e-05 -        -        -               -      
## Employee          < 2e-16 < 2e-16  -        -               -      
## In House Lawyer   1.00000 5.9e-05  < 2e-16  -               -      
## Manager           2.4e-08 1.00000  < 2e-16  8.7e-08         -      
## Managing Director 1.00000 0.01940  < 2e-16  1.00000         3.3e-05
## President         0.86132 0.35860  < 2e-16  0.18185         0.00190
## Trader            9.8e-07 1.00000  < 2e-16  1.5e-06         1.00000
## Vice President    < 2e-16 < 2e-16  0.06459  < 2e-16         < 2e-16
##                   Managing Director President Trader 
## Director          -                 -         -      
## Employee          -                 -         -      
## In House Lawyer   -                 -         -      
## Manager           -                 -         -      
## Managing Director -                 -         -      
## President         1.00000           -         -      
## Trader            0.00058           0.02020   -      
## Vice President    < 2e-16           < 2e-16   < 2e-16
## 
## P value adjustment method: bonferroni

Again it is the employee who receive the highest number of email per day. They shown the higher mean but it is close to the one of vice president. In addition the standard deviation for this 2 groups is important and maybe could overlap. This explain why the difference of email receive per day for the employee group isn’t significantly higher compared to the vice president group. The employee group is the biggest in the company (27% of the worker) and the vice president represent only 9% of the workers. Maybe the reason why they receive also a high number of email is because of their position in the company. The manager group is also one of the group who receive the higher number of email per day. Maybe, like for the vice president group, it is because of their position in the company. After those group we find the trader and the director whose receive a high number of email per day.

Like for the email send we look if those result are confirm if we rationalize the number of email received per day for each group in function of the number of worker in that group.

#Filter to get only the worker with a knowing status
df_message_status %>% filter(!is.na(status_recipient)) %>%
  group_by(date, status_sender) %>% 
  #count the number of email received per day per group as well as the distinct number of worker in each group at this date
  mutate(nb_received = n(),
  nb_received_per_gp = n_distinct(recipient)) %>% 
  ungroup()%>% 
  #made the ratio between the email send per day for each group and the number of worker in that group for that day
  mutate(ratio_nb_email_pctg = nb_received/nb_received_per_gp) %>%
  #violin box plot
  ggplot(aes(status_recipient, ratio_nb_email_pctg, fill = status_recipient)) +
  geom_violin(trim = FALSE)+
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  labs(title = "Comparison of the email received in function of the Enron's worker statuts.",
       subtitle = "Ratio to the number of worker per group.",
       x = "Source",
       y = "Ratio email per status")+
  theme(legend.position = "none")

df_message_status %>% filter(!is.na(status_recipient)) %>%
  group_by(date, status_sender) %>% 
  mutate(nb_received = n(),
  nb_received_per_gp = n_distinct(recipient)) %>% 
  ungroup()%>% 
  mutate(ratio_nb_email_pctg = nb_received/nb_received_per_gp)%>%
  #keep only distinct value
  distinct(date,status_recipient, recipient, nb_received, nb_received_per_gp, ratio_nb_email_pctg) %>% 
  #make the descriptive statistics for each recipient group
  group_by(status_recipient)%>% summarise(
    mean = mean(ratio_nb_email_pctg),
    median = median(ratio_nb_email_pctg),
    sd = sd(ratio_nb_email_pctg),
    min = min(ratio_nb_email_pctg),
    Q1 = quantile(ratio_nb_email_pctg, 0.25),
    Q3 = quantile(ratio_nb_email_pctg, 0.75),
    max = max(ratio_nb_email_pctg)
  )
## # A tibble: 9 × 8
##   status_recipient   mean median    sd   min    Q1    Q3   max
##   <fct>             <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CEO                5.69   4.35  5.02     1  2.86  6.52  67.8
## 2 Director           6.54   4.81  5.70     1  3.33  7.83  48.7
## 3 Employee           6.19   4.56  5.60     1  3.04  7.13  67.8
## 4 In House Lawyer    6.99   5.32  5.88     1  3.71  8.25  40.9
## 5 Manager            6.26   4.76  5.38     1  3.2   7.29  67.8
## 6 Managing Director  6.35   4.46  6.32     1  2.74  7.28  67.8
## 7 President          5.34   4.12  4.79     1  2.48  6.25  56.1
## 8 Trader             7.17   5.27  6.55     1  3.51  8.43  67.8
## 9 Vice President     5.53   4.17  4.99     1  2.67  6.41  67.8

If we rationalize the number of email receive by the number of worker in each group we can see they have no real different between the group. We can think that it has in each group more worker who received email than those who send them each day.

#count the number of email send and recieved per day in function of their status
send_vs_received <- df_message_status %>% 
  group_by(date, status_sender) %>% 
  mutate(nb_sender_per_group = n_distinct(sender)) %>% ungroup()%>%
  group_by(date, status_recipient) %>% 
  mutate(nb_recipient_per_group = n_distinct(recipient)) %>% ungroup()

send_vs_received <- as.data.frame(send_vs_received)
  
#descriptive statistic for both the sender and recipient
send_vs_received %>% 
  summarise(
    across(c(nb_sender_per_group,nb_recipient_per_group),
           list(mean = ~mean(.x),
                median = ~median(.x),
                sd = ~sd(.x),
                min = ~min(.x),
                Q1 = ~quantile(.x,0.25),
                Q3 = ~quantile(.x,0.75),
                max = ~max(.x))))
##   nb_sender_per_group_mean nb_sender_per_group_median nb_sender_per_group_sd
## 1                 206.8242                        159               185.2247
##   nb_sender_per_group_min nb_sender_per_group_Q1 nb_sender_per_group_Q3
## 1                       1                     80                    281
##   nb_sender_per_group_max nb_recipient_per_group_mean
## 1                    1328                    1249.528
##   nb_recipient_per_group_median nb_recipient_per_group_sd
## 1                          1168                  849.0907
##   nb_recipient_per_group_min nb_recipient_per_group_Q1
## 1                          1                       618
##   nb_recipient_per_group_Q3 nb_recipient_per_group_max
## 1                      1930                       3156
#boxplot to vizualised the descriptive statistic
p1 <- send_vs_received %>% filter(!is.na(status_sender)) %>%
  ggplot(aes(status_sender, nb_sender_per_group, fill = status_sender))+
  geom_violin(trim = FALSE)+
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  labs(title = "Comparison of the sender in function of the Enron's worker statuts.",
       x = "Source",
       y = "Number of sender per status")+
  theme(legend.position = "none")

p2 <- send_vs_received %>% filter(!is.na(status_recipient)) %>%
  ggplot(aes(status_recipient, nb_recipient_per_group, fill = status_recipient))+
  geom_violin(trim = FALSE)+
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  labs(title = "Comparison of the recipient in function of the Enron's worker statuts.",
       x = "Source",
       y = "Number of recipient per status")+
  theme(legend.position = "none")

p1/p2

We can see that, it as in average more person in a group who receive email each day compared to the number of person who send them. This is especially true for the worker in the employee, trader, vice president, and director groups.

From all of those we can deduce that, it seems the most active Enron’s worker int the email exchange is Jeff Dasovitch. In general it is the employee who are the more active in email exchange. When we rationalize the number of email sent in function of the number of worker per group we could see that, the employee are really the more active for sending email but at some point the CEO group send a high number of email due to the Enron’s events. If we look at the number of email receive in function of the number of worker in a group we see no real different between the group suggesting it as more person who receive email each day than person who send them.

Next we take a look at the flux of the email exchange between the different status over the study period to see if it change.

We now look if along the year it as a change in the interaction between the Enron’s worker with a knowing status. For that per year we draw chord diagram which allows to follow the links between group.

#plot for each year follow the exchange between group
per_year <- df_message_status %>% select(date, status_sender, status_recipient) %>%
  filter(!is.na(status_sender) & !is.na(status_recipient)) %>%
  mutate(year = format(date,"%Y"),
         #to enhance the clarity we group certain status with similar level of responsability together
         status_sender = case_when(
           status_sender %in% c("Managing Director", "Manager", "Director") ~ "Manger - Director",
           status_sender %in% c("CEO", "Vice President", "President") ~ "CEO - President",
           .default = status_sender),
         status_recipient = case_when(
           status_recipient %in% c("Managing Director", "Manager", "Director") ~ "Manger - Director",
           status_recipient %in% c("CEO", "Vice President", "President") ~ "CEO - President",
           .default = status_recipient)) %>%
  group_by(date,status_sender, status_recipient) %>%
  mutate(number_exchange = n()) %>% ungroup() %>%
  distinct(date, status_sender, status_recipient, number_exchange, year)

year_1999 <- as.data.frame(per_year %>% filter(year == 1999) %>%
  group_by(status_sender, status_recipient) %>%
  mutate(sum = sum(number_exchange)) %>% ungroup() %>%
  distinct(status_sender, status_recipient, sum) %>%
    filter(status_sender != status_recipient) %>%
    arrange(status_sender, status_recipient)
)

year_2000 <- as.data.frame(per_year %>% filter(year == 2000) %>%
  group_by(status_sender, status_recipient) %>%
  mutate(sum = sum(number_exchange)) %>% ungroup() %>%
  distinct(status_sender, status_recipient, sum) %>%
    filter(status_sender != status_recipient) %>%
    arrange(status_sender, status_recipient)
)

year_2001 <- as.data.frame(per_year %>% filter(year == 2001) %>%
  group_by(status_sender, status_recipient) %>%
  mutate(sum = sum(number_exchange)) %>% ungroup() %>%
  distinct(status_sender, status_recipient, sum) %>%
    filter(status_sender != status_recipient) %>%
    arrange(status_sender, status_recipient)
)

year_2002 <- as.data.frame(per_year %>% filter(year == 2002) %>%
  group_by(status_sender, status_recipient) %>%
  mutate(sum = sum(number_exchange)) %>% ungroup() %>%
  distinct(status_sender, status_recipient, sum) %>%
    filter(status_sender != status_recipient) %>%
    arrange(status_sender, status_recipient)
)

#the color for each status
status_color <- c(
  "Employee" = "pink",
  "CEO - President" = "orange",
  "Trader" = "springgreen3",
  "Manger - Director" = "violetred4",
  "In House Lawyer" = "purple4")

Display the chord diagram of the year 1999

adjacencyData_99 <-with(year_1999, table(status_sender, status_recipient))
chordDiagram(adjacencyData_99, transparency = 0.5, grid.col = status_color)

year 2000

adjacencyData_00 <-with(year_2000, table(status_sender, status_recipient))
chordDiagram(adjacencyData_00, transparency = 0.5, grid.col = status_color)

year 2001

adjacencyData_01 <-with(year_2001, table(status_sender, status_recipient))
chordDiagram(adjacencyData_01, transparency = 0.5, grid.col = status_color)

year 2002

adjacencyData_02 <-with(year_2002, table(status_sender, status_recipient))
chordDiagram(adjacencyData_02, transparency = 0.5, grid.col = status_color)

For the email exchange we can see that:

  • The trader in 1999 exchange only with employee but then they exchange also with manager/director and CEO/president. Surprinsigly it seems the trader never exchange with the in house lawyer. Maybe their email exchange are undirect.

  • In 2002 the in house lawyer have received emain only from the manager/director. But at this period, we don’t see the email flux from the in house lawyer to other worker in the company with a knowing status. Maybe they send email to external person for managing the bankruptcy of the company with the info they received from the manager and director.

  • The in house Lawyer exchange in 2000 only with the manager/director and the CEO/President but in 2001 they also exchange with employee. Maybe the change in the email flux for the in house lawyer is related to the Enron event where it could have a need to inform the employee about some affair for they can answer to the SEC investigations.

This last analyze highlight the change in the email flux over the study period. Some change could be linked with the Enron event.

The number of email send/receive per month over the year.

The data set we have cover the email exchange between Enron’s worker from 1999 to 2002. From 1999 to early 2001 the company was in good health. From the midle of 2001 the fraud made by the company start to become public and put the company in trouble. Through the email history we will look if over the months it as a change for the number of email send/receive in function of the worker status.

We look over the month of each year which are the worker status the most active. For the employee.

#list of status in the Enron company
status_list <- c("Employee", "CEO", "Manager", "Director", "Vice President", "Trader", "President", "Managing Director", "In House Lawyer")

month_label <- c("01" = "January","02" = "February","03" = "March","04" = "April","05" = "May","06" = "June","07" = "July","08" = "August",
               "09" = "September","10" = "October","11" = "November","12" = "December")

month_color <- c("01" = "lightgreen","02" = "lightsalmon4","03" = "lightblue","04" = "greenyellow","05" = "cyan","06" = "darkgreen","07" = "lavender",
               "08" = "plum","09" = "coral","10" = "honeydew4","11" = "hotpink","12" = "indianred")

#initiate the list for the plot
email_send <- list()

#loop allowing to construct a bar plot to display per month the number of email send in function of the worker status
for(i in seq(status_list)){
  
  status <- status_list[i]
  
  p <- df_message_status %>% filter(status_sender == status) %>% #take the value in the list
  group_by(year,month)%>%
  count() %>% 
    #bar plot
    ggplot(aes(month, n, fill = month))+
  geom_bar(stat = "identity") +
  facet_grid(~year)+
  labs(title = paste("Email send per month for each year by the", status),
       y = "Email count per month")+
  scale_fill_manual(
    values = month_color,
    labels = month_label)+
  theme(legend.position = "bottom",
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank())
  
  email_send[[i]] <- p}

#display the plot create
n <- length(email_send)

plot_per_section <- 3

for(j in seq(1,n,by=plot_per_section)){
  
  plot_on_the_page <- email_send[j:min(j+2, n)]
  
  #extract the legend from the first plot on the layout
  legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
  
  #remove the legend for all plot on the layout
  no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
  
  #display 4 plots per layout
  grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
  
  #combine together the 3 plot and one legend
  plots_with_legend <- arrangeGrob(
    grid_plot,
    legend,
    nrow = 2,
    #arrange the plot and the legend in the layout
    heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
  )
  
  #display everything together
  grid.newpage()
  grid.draw(plots_with_legend)
  
}

By looking year by year we can see that:

  • It is the worker with an employee status who send the higher number of email in the different years. The number of email send follow the trend we observed when we look at all the Enron’s worker suggesting that the employee influence the general email exchange number per month in the company. That could be link to the number they are in the company. In 2001 the employee group was the one who send the highest number of email.

  • The CEO appear in the email send from January, 2000 which is the moment is role is formally declared in the company. It send a high number of email compared to directors and managing directors group. Especially in the year 2001 in April, May, October, and November it send an important number of email. Maybe this is related to the fiscal fraud investigation.

  • In the year 2001, the number of email send by the in house lawyer is the higher compared to the other year. Suggesting they are imply in the invest in the fiscal fraud management inside the company.

  • The trader are the 3rd group who send a high number of email per month which is logic with the company activity.

Now we look for the email receive in function of the Enron’s worker status.

#initiate the list for the plot
email_received <- list()

#loop allowing to construct a bar plot to display per month the number of email send in function of the worker status
for(i in seq(status_list)){
  
  status <- status_list[i]
  
  p <- df_message_status %>% filter(status_recipient == status) %>% #take the value in the list
  group_by(year,month)%>%
  count() %>% 
    #bar plot
    ggplot(aes(month, n, fill = month))+
  geom_bar(stat = "identity") +
  facet_grid(~year)+
  labs(title = paste("Email received per month for each year by the", status),
       y = "Email count per month")+
  scale_fill_manual(
    values = month_color,
    labels = month_label)+
  theme(legend.position = "bottom",
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank())
  
  email_received[[i]] <- p}

#display the plot create
n <- length(email_received)

plot_per_section <- 3

for(j in seq(1,n,by=plot_per_section)){
  
  plot_on_the_page <- email_received[j:min(j+2, n)]
  
  #extract the legend from the first plot on the layout
  legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
  
  #remove the legend for all plot on the layout
  no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
  
  #display 4 plots per layout
  grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
  
  #combine together the 3 plot and one legend
  plots_with_legend <- arrangeGrob(
    grid_plot,
    legend,
    nrow = 2,
    #arrange the plot and the legend in the layout
    heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
  )
  
  #display everything together
  grid.newpage()
  grid.draw(plots_with_legend)
  
}

The plot above shows that:

  • Like for the email send, it is the employee who receive the higher number. They follow the same trend as we saw for the email send suggesting they are active in email exchange in general.

  • The trader seems to received more email than sending them.

  • For the group at the head of the company (CEO, Managing director, director, president and vice president) the number of email receive follow the Enron’s fiscal fraud event with a high pick in 2001 for the months April, May, October, and November.

  • For the year 2001, the vice president group receive a lot of email compared to the other head group of the company.

  • It is for the year 2001 the group in house lawyer seems to receive the higher number of email.

#envrionment cleaning
rm(jeff_stat, recipient_stat, statuts_stat, violin_plot, violin_plot1, violin_plot2, violin_worker, p1, p2, send_vs_received)

Now we try to see who is the most active in the email exchange. For that, we start by counting the number of email send per each worker and return the 10 persons who send the highest number.

#Display the top 10 email address of sender
p1 <- df_message_status %>% group_by(sender)%>% count() %>% #to count the number of email send per email address
  ungroup() %>%
  #calculate the percentage for each sender
  mutate(perc = round(`n`/sum(`n`),3),
  labels = scales::percent(perc)) %>% 
  arrange(desc(n)) %>% head(10) %>% #to get only the 10 email address with the most important number of email send
  #bar chart
  ggplot(aes(reorder(sender, perc, sum), perc, fill = sender)) +
  geom_bar(stat="identity") +
  coord_flip() +
  #graph title and label
  geom_text(aes(label = labels), vjust = 0.5, size = 4) + #display the percentage for each category at the end of the corresponding bar
  scale_y_continuous(labels = scales::percent_format())+  
  labs(title = "Top 10 Enron's employee email sender")+
  xlab("Employee's email addres")+
  ylab("Email send per sender (%)") +
  scale_fill_brewer(palette = "Set3")+
    theme(legend.position = "none",
        plot.margin = margin(10, 10, 10, 20))

#Display the top 10 email address of recipient
p2 <- df_message_status %>% filter(rtype == "TO") %>% #select only the email of the direct concerned receiver
  group_by(recipient)%>% count() %>% #to count the number of email send per email address
  ungroup() %>%
  #calculate the percentage for each sender
  mutate(perc = round(`n`/sum(`n`),4),
  labels = scales::percent(perc)) %>% 
  arrange(desc(n)) %>% head(10) %>% #to get only the 10 email address with the most important number of email send
  #bar chart
  ggplot(aes(reorder(recipient, perc, sum), perc, fill = recipient)) +
  geom_bar(stat="identity") +
  coord_flip() +
  #graph title and label
  geom_text(aes(label = labels), vjust = 0.5, size = 4) + #display the percentage for each category at the end of the corresponding bar
  scale_y_continuous(labels = scales::percent_format())+ 
  labs(title = "Top 10 Enron's employee email receiver",
       subtitle = "Only principal receiver")+
  xlab("Employee's email address")+
  ylab("Email recived per recipient (%)") +
  scale_fill_brewer(palette = "Set3")+
  theme(legend.position = "none",
        plot.margin = margin(10, 10, 10, 20))

#arrange the plot on the same place
p1 / p2

Jeff Dasovitch seems to be the most active worker in Enron for email exchange where for the period of study it’s him who send the higher proportion of email (3.2%) and received the highest proportion (0.51%).

#return only one result from that query to get the status of the most active sender/recipient
head(df_message_status[df_message_status$sender == "jeff.dasovich@enron.com", "status_sender"], 
     n=1)
## [1] Employee
## 10 Levels: CEO Director Employee In House Lawyer Manager ... Vice President

In the employee data set he is described to be an Employee of Enron. To see if it is really the most active we will compared the number of email send and received by him to the other worker with the same status (Employee) and to all the worker of Enron company.

Compared the number of email send by the worker who seems to be the more active (Jeff Dasovich), by all worker of it’s status (Employee), and all Enron’s worker.

For that we will compute descriptive comparative statistic between them.

#count the number of email send by jeff dasovich per day
jeff_stat_send <- df_message_status %>% filter(sender == "jeff.dasovich@enron.com") %>%
  #we count the number of different email subject send per day
  group_by(date, subject) %>% 
  summarise(email_count = n(), .groups = "drop") %>%
  mutate(source = "Jeff Dasovich") %>% transform(source = as.factor(source))

#count the number of email send by Enron's worker per day
sender_stat <- df_message_status %>% 
  #we count the number of different email subject send per day by each sender
  group_by(date, sender, subject) %>% 
  summarise(email_count = n(), .groups = "drop") %>%
  mutate(source = "Enron's worker") %>% select(-sender) %>% transform(source = as.factor(source))

#count the number of email send by Employee status per day
statuts_stat_send <- df_message_status %>% filter(status_sender == "Employee") %>% 
  #we count the number of different email subject send per day by each sender of status employee
  group_by(date, sender, subject) %>% 
  summarise(email_count = n(), .groups = "drop") %>%
  mutate(source = "Employee status") %>% transform(source = as.factor(source))

#combine the rows together to create a unique dataframe and compared the enron's worker and the employee to Jeff
violin_plot1 <- bind_rows(jeff_stat_send, statuts_stat_send)
violin_plot2 <- bind_rows(jeff_stat_send, sender_stat)

#compared the 2 groups per a t.test to see if jeff dasovitch is really most active than the other employee 
p3 <- ggplot(violin_plot1, aes(as.factor(source), email_count, fill = as.factor(source))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  #display the comparative statistic on the violin plot
  stat_compare_means(method = "t.test", label.y = max(violin_plot1$email_count) - 400)+
  labs(title = "Comparison of the email count between 
       Jeff Dasovitch and the Enron's Employee",
       x = "Source",
       y = "Email count per day") +
  #to better see the violin plot we break the y axis
  scale_y_break(c(100, 3000), scales = 0.3)+
  #set up the color for each resources
  scale_fill_manual(values = c(
      "Jeff Dasovich" = "tomato2",
      "Employee status" = "yellowgreen"))+
  #withdraw the legend form the plot
  theme(legend.position = "none")

#same plot but to compared Jeff Dasovitch to the Enron's worker
p4 <- ggplot(violin_plot2, aes(as.factor(source), email_count, fill = as.factor(source))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  stat_compare_means(method = "t.test", label.y = max(violin_plot2$email_count) - 2000)+
  scale_y_break(c(250, 15000), scales = 0.3)+
  labs(title = "Comparison of the email count between 
       Jeff Dasovitch and the Enron's worker",
       x = "Source",
       y = "Email count per day") +
  scale_fill_manual(#set up the color for each resources
    values = c(
      "Jeff Dasovich" = "tomato2",
      "Enron's worker" = "cyan"))+
  theme(legend.position = "none")

#arrange the plot on the same place
p3 + p4

#display the stat of the different group
violin_plot <- bind_rows(jeff_stat_send, sender_stat, statuts_stat_send)

violin_plot %>% group_by(source)%>%
  summarise(
    mean = mean(email_count),
    sd = sd(email_count),
    min = min(email_count),
    Q1 = quantile(email_count, 0.25),
    Q3 = quantile(email_count, 0.75),
    max = max(email_count)
  )
## # A tibble: 3 × 7
##   source           mean    sd   min    Q1    Q3   max
##   <fct>           <dbl> <dbl> <int> <dbl> <dbl> <int>
## 1 Jeff Dasovich   15.6   45.7     1     1     9   760
## 2 Enron's worker  10.6   80.6     1     1     5 18445
## 3 Employee status  5.49  29.9     1     1     3  3556

The table who summarise the email send by group show us that:

  • It is Jeff Dasovitch who have the highest average for the number of email sent per day. The lowest is for the Enron’s employee.

  • By looking at the quantile, which represent respectively the 25% of the value and the 75% of the value, it is also Jeff who have the highest value for the quantile 3 especialy compared to the Enron’s Employee.

  • Surprinsingly it is the Enron’s worker who have the highest number of email send for a day. Maybe that is link with the Enron event.

From this we can deduce that, Jeff Dasovitch is significantly the most active Enron’s worker in the email sending.

Then we look at the email recieved by Jeff Dasovitch compared to Enron’s worker of the same status and to all Enron’s worker.

#statistics on the jeff dasovich email receive per day
jeff_stat_rec <- df_message_status %>% filter(recipient == "jeff.dasovich@enron.com") %>%
  group_by(date) %>% 
  summarise(email_count = n(), .groups = "drop") %>%
  mutate(source = "Jeff Dasovich") %>% transform(source = as.factor(source))

#statistics on the email send per day by the enron's worker
recipient_stat <- df_message_status %>% group_by(date, recipient) %>% 
  summarise(email_count = n(), .groups = "drop") %>%
  mutate(source = "Enron's worker") %>% select(-recipient) %>% transform(source = as.factor(source))

#statistics on the email send per day by the enron's worker who have an employee statuts
statuts_stat_rec <- df_message_status %>% filter(status_recipient == "Employee") %>% group_by(date) %>% 
  summarise(email_count = n(), .groups = "drop") %>%
  mutate(source = "Employee status") %>% transform(source = as.factor(source))

#combine the rows together to create a unique dataframe and compared the enron's worker and the employee to Jeff
violin_plot1 <- bind_rows(jeff_stat_rec, statuts_stat_rec)
violin_plot2 <- bind_rows(jeff_stat_rec, recipient_stat)

#compared the 2 groups per a t.test to see if jeff dasovitch is really most active than the other employee and/or worker in Enron's company
p3 <- ggplot(violin_plot1, aes(as.factor(source), email_count, fill = as.factor(source))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  #compared statisticaly the 2 group to see if the difference is significant or not
  stat_compare_means(method = "t.test", label.y = max(violin_plot1$email_count) + 2)+
  labs(title = "Comparison of the email count between 
       Jeff Dasovitch and the Enron's Employee",
       x = "Source",
       y = "Email count per day") +
  theme(legend.position = "none")+
  scale_fill_manual(#set up the color for each resources
    values = c(
      "Jeff Dasovich" = "tomato2",
      "Employee status" = "yellowgreen"
    ))

p4 <- ggplot(violin_plot2, aes(as.factor(source), email_count, fill = as.factor(source))) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA, color = "white")+
  ylim(c(-10,350))+
  stat_compare_means(method = "t.test", label.y = 300)+
  labs(title = "Comparison of the email count between 
       Jeff Dasovitch and the Enron's worker",
       x = "Source",
       y = "Email count per day") +
  theme(legend.position = "none")+
  scale_fill_manual(#set up the color for each resources
    values = c(
      "Jeff Dasovich" = "tomato2",
      "Enron's worker" = "cyan"
    ))

#arrange the plot on the same place
p3 + p4

violin_plot <- bind_rows(jeff_stat_rec, recipient_stat, statuts_stat_rec)

violin_plot %>% group_by(source) %>%
  summarise(
    mean = mean(email_count),
    median = median(email_count),
    sd = sd(email_count),
    min = min(email_count),
    Q1 = quantile(email_count, 0.25),
    Q3 = quantile(email_count, 0.75),
    max = max(email_count)
  )
## # A tibble: 3 × 8
##   source           mean median     sd   min    Q1    Q3   max
##   <fct>           <dbl>  <dbl>  <dbl> <int> <dbl> <dbl> <int>
## 1 Jeff Dasovich   17.5      10  19.2      1     3   25    113
## 2 Enron's worker   3.19      2   6.36     1     1    3   1153
## 3 Employee status 98.6      40 156.       1     7  122.  1333

When we look at the number of email received, Jeff Dasovich received significantly more email than another Enron’s worker in average. Perhaps, when we compared him to the other employee he don’t recived more email compared to another employee. On the contrary, it significantly received less than other. For the Enron’s worker with an employee status we observed that, the mean is far from the median suggesting it exist extreme value for that group. The violin of the employee highlight that where we can see above the 3rd quantile it as a long queue which start around 120 to become extremely thin after 250. On the contrary for Jeff Dasovich violin above the 3rd quantile the violin queue isn’t become finer but it seems to always has an important number of observation with this values. All of those suggest that for the employee it has some event which made them received an extremely high number of email, this pick isn’t see for Jeff Dasovich.

From this part of the analyze we can say that:

- Jeff Dasovich is the Enron worker who send and recieved the highest number of email.

- Compared to other worker with an employee status he significantly send more email but he received less.

- It is possible that, it has some events whose made other employee than Jeff Dasovich receiving more email in one day. We could thing Jeff Dasovich is one of the employee who receive the most email per day but not the only one.

All of those suggest that, Jeff Dasovich could be the most active in the email exchange of the Enron company.

Analyze of the email subject and content

In our data set we have 2063706 rows with email content which represent 10%. This make the email content is few exhaustive compared to the email subject which is describe for every email exchange.

String_var_stat <- df_message_status %>% distinct(reference, subject) %>% mutate(
  emailTextLength = str_length(reference),
  emailSubjectLength = str_length(subject)) 

summary(String_var_stat)
##   reference              subject       emailTextLength  emailSubjectLength
##  Length:157194      RE:      :  2744   Min.   :    0    Min.   :  0.00    
##  Class :character   FW:      :   585   1st Qu.:  505    1st Qu.: 17.00    
##  Mode  :character   RE: Hello:    82   Median : 1021    Median : 26.00    
##                     RE: Hi   :    56   Mean   : 1759    Mean   : 30.35    
##                              :    52   3rd Qu.: 2040    3rd Qu.: 40.00    
##                     RE: Lunch:    48   Max.   :65535    Max.   :255.00    
##                     (Other)  :153627   NA's   :110536

When we look at the distinct subject and reference we can see that :

  • In average an email text contain 1 759 characters and 75% of the email’s text have less or are equal to 2040 characters suggested the content of the email are short exchange about focusing subject.

  • In average the email subject is 30 characters with 75% of the email’s subject have no more than 40 characters.

To investigate the subject and text of the email we have, we create 4 list of different topics which will be researched in the email subject:

  • email related to meeting by looking to words such as message, please, email, inform.

  • email related to the business processes and business legalities such as enron, deal, change, corp, date, america

  • email related to the core business of Enron like gas, power, trade.

This key words come from the wikipedia page about the Enron’s event wikipedia page about Enron timeline downfall.

Each word/concept will be research individually in the email content to follow the email exchange whose contain them as well as the Enron’s worker status imply in those exchange.

The analyze is realized over the study period to highlight period where those topics/key words are more used by the enron worker. Then we will look if it has worker status who used them more than other to finally look at some specific enron worker know to be involved in the Enron’s events.

Research of the 4th topics in the email subjects as well as key word in email content.

#topics list 

topic_meeting <- c("message|origin|pleas|email|thank|attach|file|copi|inform|receiv|thank|all|time|meet|look|week|day|dont|vinc|talk")

topic_business_process <- c("enron|deal|agreement|chang|contract|corp|fax|houston|date|america|risk|analy|confidential|correction")

topic_core_business <- c("market|gas|price|power|company|energy|trade|busi|servic|manag")

topic_enron_event <- c("bankrup|SEC|MTM|fear|losing money|10-K|fears|investigation|phone|fax|document")
#construction of the data set for measuring the frequency of the different topic in the email subject as well as the number of email with specific word, we focus on the sender status

email_subject_send <- df_message_status %>% distinct(date, year, month, sender, status_sender, subject, reference) %>%
  mutate(#count the number of email which contain at least one word in the list of each topic
    subject_meeting = if_else(str_detect(subject, topic_meeting), 1, 0),
    subject_business_process = if_else(str_detect(subject, topic_business_process), 1, 0),
    subject_core_business = if_else(str_detect(subject, topic_core_business), 1, 0),
    subject_enron_event = if_else(str_detect(subject, topic_enron_event), 1, 0),
    email_meeting = if_else(str_detect(reference,topic_meeting), 1, 0),
    email_business_process = if_else(str_detect(reference, topic_business_process), 1, 0),
    email_core_business = if_else(str_detect(reference, topic_core_business), 1, 0),
    email_enron_event = if_else(str_detect(reference, topic_enron_event), 1, 0),
    #to get the date in year/month
    year_month = as.Date(paste0(year,"-",month,"-01"))) 

Because the number of line which contain email description is lower than the length of the table the research of the keyword about Enron event in the email create many NA value. To be able to compute the sum of the email which contain those word we use the parameter na.rm = TRUE which consider the NA as it is a 0 in the data set to compute the sum.

In the following part we will create plot which will represent the email exchange about specific topics. To homogenized the apparent of those plot we declared a color and a label for each category for they can be apply at every plot.

#the list of category studied and their related color in each plot
topic_colors <- c("sum_subject_business_process" = "steelblue4",
                  "sum_subject_core_business" = "orchid",
                  "sum_subject_meeting" = "chocolate4",
                  "sum_subject_enron_event" = "yellowgreen",
                  "sum_email_business_process" = "cyan3",
                  "sum_email_core_business" = "plum4",
                  "sum_email_meeting" = "salmon",
                  "sum_email_enron_event" = "springgreen4")



#the list of category and their related label on the plot  
topic_label <- c("sum_subject_business_process" = "Business process email subject",
                 "sum_subject_core_business" = "Core Business email subject",
                 "sum_subject_meeting" = "Meeting email subject",
                 "sum_subject_enron_event" = "Enron Event email subject",
                 "sum_email_business_process" = "business process email",
                 "sum_email_core_business" = "core business email",
                 "sum_email_meeting" = "meeting email",
                 "sum_email_enron_event" = "enron event email")
#compute the sum of each topics for each month of each year study
email_subject_send_graph <- email_subject_send %>% 
  group_by(year_month) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, subject, reference, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)



#display the different topic trend in the email subject over the study's period
email_subject_send_graph %>% select(year_month, starts_with("sum_subject_")) %>%
  #change the orientation of the data set
  pivot_longer(
  cols = 2:5,
  names_to = "topics",
  values_to = "value") %>%
  #scatter plot and trend line
  ggplot(aes(year_month,value, color=topics))+
  geom_line(size = 1)+
  #label, axis, and legend
  labs(color = "Email subject topics",
    title = "Email topics in function of the year",
       x = "year",
       y = "Number of email per topics") +
  #to display the year and month, every 3 months for a better reading
  scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
  scale_color_manual(#to get only the customization for the email categories
    values = topic_colors[1:4],
    labels = topic_label[1:4])

We can see that:

  • the top topic is about the meeting then we have the business process and the business core.

  • For the meeting we have 3 picks:

    • one between October, 2000 and January, 2001 maybe to organize the new year and close the past year.

    • one April to July, 2001 which is the period where the head of the company start to be worry about the business process.

    • the highest pick is between October 2001 and January, 2002 the period where the fiscal fraud is discover by the federal agency.

    • For the business process and core topics we see 2 picks which follows the 2 last picks of the meeting topics. This suggest the topic of the meeting concern the business. We could think those meeting are more related to the business process than the business core.

    -The email about the enron event are the fewest but we can see a pick of the topic from October 2001 to average February 2002. This make sens with the knowing event where the company was put in bankruptcy at this period.

For the email subject we look at the frequency of the word we search in them.

#the list of word research in the subject
word_list <- list("message","origin","pleas","email","thank","attach","file","copi","inform","receiv","thank","time","meet",
                  "look","week","dont","vinc","talk","enron","deal","agreement","chang","contract","corp","fax","houston","america",
                  "risk","analy","confidential","correction", "market","gas","price","power","company","energy","trade","busi","servic","manag",
               "bankrup","SEC","MTM","fear", "investigation", "mark-to-market", "10-K", "losing money", "correction", "phone", "fax", "document")

#initiate a vector for registering their frequency
word_count <- c()

##iterate over the list and count the number of time we see each word in the list
for(i in seq_along(word_list)){
  
  search <- as.character(word_list[[i]])
  nb <- sum(str_count(email_subject_send_graph$subject, search))
  
  word_count <- c(word_count, nb)
  
}

#draw a wordcloud which represent the words frequency

par(bg = "black")
wordcloud(word_list, word_count, min.freq = 10 ,max.words=length(word_list), col=heat.colors(length(word_list), alpha = 0.9), rot.per = 0.3)
title(main = "The top words seen in the email subject", col.main = "white",font.main = 2)

To read the heatmap, the word the must see are those in white and with the biggest size. The word the less see are those in red and with the smaller size. This heatmap highlight this:

  • the most often word see in that list is meet, which is logic with the fact most of the email subject are seen in the meeting topic category.

  • After we find a lot of words related to the business process at Enron such as deal, agreement, change, contract.

  • The smaller word are link to the enron event such as bankruptcy, MTM, SEC. This suggest that, the email exchange aren’t explicitely about the enron event. Maybe we could find more of them in the email content.

#display the different topic trend in the email subject over the study's period
email_subject_send_graph %>% select(year_month, starts_with("sum_email_")) %>%
  #change the orientation of the data set
  pivot_longer(
  cols = 2:5,
  names_to = "email",
  values_to = "value") %>%
  #scatter plot and trend line
  ggplot(aes(year_month,value, color=email))+
  geom_line(size = 1)+
  #label, axis, and legend
  labs(color = "Email content key words",
    title = "Email key word about the Enron event in function of the year",
       x = "year",
       y = "Number of email per key words") +
  #to display the year and month, every 3 months for a better reading
  scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
  scale_color_manual(#to get only the customization for the email categories
    values = topic_colors[5:8],
    labels = topic_label[5:8])

In the email content we can see that:

  • For all topics investigate we find a peack of email containing them from April,2001 to April, 2002 which is related to a peack of email exchange as we saw earlier in this analyse. In addition, this period is when the company was under the SEC investigation and, in late 2001/early 2002 the bankruptcy process.

  • The email contain in majority words about meeting. Then we find word relate to business process. Surprinsingly we don’t find many email containing words link with the Enron event. This suggest that, the enron event are communicate through other way such as fax and phone call.

Like for the subject we can look at the frequency of each words in the email text:

#reduce the dataset to the row which contain email text
df_reference <- filter(email_subject_send_graph, !is.na(reference))

#initiate the liste for storing the count for each words
email_words_freq <- c()

#loop allowing to extract the words in each email text and count the number of type they are found
for(i in seq_along(word_list)){
  
  word <- as.character(word_list[[i]])
  #we pass through a locate to return in a list the index of the row where we find them
  counting <- as.list(str_locate(df_reference$reference, word))
  
  #we count the index for which we don't have NA
  nb <- sum(!is.na(counting))
  
  #store the frequency for each words in the email text
  email_words_freq <- c(email_words_freq, nb)
  
}

#draw the wordcloud with the frequency of each word
par(bg="black")
wordcloud(word_list, email_words_freq, min.freq = 10 ,max.words=length(word_list), col=heat.colors(length(word_list), alpha = 0.9), rot.per = 0.3)
title(main = "The top words seen in the email text", col.main = "white",font.main = 2)

This heatmap is read like for the email subject, this one show us:

  • The top word are enron and please which are related to meeting and enron business process.
  • The word the must seen after that are relate to meeting (attach, inform, receiv). Then we find word link with the business process such as contract, chang, confidential. We find often the words fax and phone suggested in the email refer to phone call or fax which let us thinking they at this time communicate a lot through this way.

Then we look at the number of email received during the study period about those topics.

email_subject_rec <- df_message_status %>% distinct(date, year, month, recipient, status_recipient, subject, reference) %>%
  mutate(#count the number of email which contain at least one word in the list of each topic
    subject_meeting = if_else(str_detect(subject, topic_meeting), 1, 0),
    subject_business_process = if_else(str_detect(subject, topic_business_process), 1, 0),
    subject_core_business = if_else(str_detect(subject, topic_core_business), 1, 0),
    subject_enron_event = if_else(str_detect(subject, topic_enron_event), 1, 0),
    email_meeting = if_else(str_detect(reference,topic_meeting), 1, 0),
    email_business_process = if_else(str_detect(reference, topic_business_process), 1, 0),
    email_core_business = if_else(str_detect(reference, topic_core_business), 1, 0),
    email_enron_event = if_else(str_detect(reference, topic_enron_event), 1, 0),
    #to get the date in year/month
    year_month = as.Date(paste0(year,"-",month,"-01"))) 
#compute the sum of each topics for each month of each year study
email_subject_rec_graph <- email_subject_rec %>% 
  group_by(year_month) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, subject, reference, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)



#display the different topic trend in the email subject over the study's period
email_subject_rec_graph %>% select(year_month, starts_with("sum_subject_")) %>%
  #change the orientation of the data set
  pivot_longer(
  cols = 2:5,
  names_to = "topics",
  values_to = "value") %>%
  #scatter plot and trend line
  ggplot(aes(year_month,value, color=topics))+
  geom_line(size = 1)+
  #label, axis, and legend
  labs(color = "Email subject topics",
    title = "Email received in function of their subject",
       x = "year",
       y = "Number of email") +
  #to display the year and month, every 3 months for a better reading
  scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
  scale_color_manual(#to get only the customization for the email categories
    values = topic_colors[1:4],
    labels = topic_label[1:4])

Here for the subject of the email received we distinct two peack for each subject, the 1st from July, 2000 to July, 2001 and 2nd from August, 2001 to April, 2002. This 2 peacks are included in the 3 peacks seen in the email send. For the topics, we see the same pattern as for the email send.

#display the different topic trend in the email subject over the study's period
email_subject_rec_graph %>% select(year_month, starts_with("sum_email_")) %>%
  #change the orientation of the data set
  pivot_longer(
  cols = 2:5,
  names_to = "email",
  values_to = "value") %>%
  #scatter plot and trend line
  ggplot(aes(year_month,value, color=email))+
  geom_line(size = 1)+
  #label, axis, and legend
  labs(color = "Email content key words",
    title = "Email received in function of key words in the email content",
       x = "year",
       y = "Number of email") +
  #to display the year and month, every 3 months for a better reading
  scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
  scale_color_manual(#to get only the customization for the email categories
    values = topic_colors[5:8],
    labels = topic_label[5:8])

For the email received about those topics/keywords we see a similar pattern than the email send suggesting their are exchange.

To go deeper in the email content analysis we next look at the topics and key words find in function of the worker status. For that we create a similar data frame than the previous but by making the count of topics/email in function of the employee status.

status_email_subject <- email_subject_send %>% 
  #we focus on the worker which their status are know
  filter(!is.na(status_sender)) %>%
  #compute the sum of each topics for each year studied
  group_by(year_month, status_sender) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, status_sender, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)

#pivot the data frame
status_email_subject <- status_email_subject %>%
  pivot_longer(
    cols = 3:length(status_email_subject),
    names_to = "topic_email",
    values_to = "value")
status_list <- c("Employee", "CEO", "Manager", "Director", "Vice President", "Trader", "President", "Managing Director", "In House Lawyer")

#initiate the list to collect the plot
plot_list <- list()

#generating individual plot for each status
for(i in seq(status_list)){
  #assign the status to the variable
  status <- status_list[i]
  
  #the plot related to that status
  p <- status_email_subject %>% filter(status_sender == status) %>%
         ggplot(aes(year_month, value, color = topic_email))+
         geom_line(size = 1)+
         labs(color = "Email key words and topics",
           title = paste("Email send by", status, ", content and subject analyze"),
           y = "Email count",
           x = "date")+
      scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
         scale_color_manual(values = topic_colors,
                    labels = topic_label)+
        theme(legend.text.position = "bottom")
  
  #append the plot list
  plot_list[[i]] <- p
}


#display the plot created
n <- length(plot_list)

#number of plot per layout
plot_per_section <- 3

#create plot layouts
for (i in seq(1, n, by=plot_per_section)){
  
  plot_on_the_page <- plot_list[i:min(i+2, n)]
  
  #extract the legend from the first plot on the layout
  legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
  
  #remove the legend for all plot on the layout
  no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
  
  #display 4 plots per layout
  grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
  
  #combine together the 3 plot and one legend
  plots_with_legend <- arrangeGrob(
    grid_plot,
    legend,
    nrow = 2,
    #arrange the plot and the legend in the layout
    heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
  )
  
  #display everything together
  grid.newpage()
  grid.draw(plots_with_legend)
  
}

By analyzing the email subject and the email content in function of the Enron’s worker status we can see that:

  • Every status shows a peack of email about those topics from April 2001 to January 2002. Also, the top topic for all is the meeting then the business process.Moreover the tendency we see for the email text is similar for the email’s subject.

  • The pattern of the email send by the employee follow the topic we saw for the enron’s worker previously. After email about meeting we see an important number of email about the businees process, less are about the core business. This could be link with the investigation where the employee send email about the process they are involved.

  • For the in house lawyer we can see 2 peack of email in 2001 for the email which speak about meeting and business process. The 1st is from February 2001 to July 2001 and the 2nd from August 2001 to November 2001. This two periods are link to the investigation by the SEC. We could think that, those email are for managing the investigation.

  • For the managing director, before June 2001 we can’t really distinct any top topic in the email content and subject. After that and until December, 2001 we have a peak of eamil talking about meeting, business process as well as core business. Here the both business topics seems to be at the same level. We distinct a similar tendency for the Manager. We can think that, at this period the manager have a lot of meeting to manage the both side of the Enron businesses.

  • The trader send an important number of email about the core business and process from July 2001 to March, 2002. They speak a little about the enron event.

  • Surprinsingly the CEO show an important peack of email related to meeting, core business and process from December, 2000 to May, 2001 and then from November, 2001 to January, 2001. We can distinct a little peack of email speaking about the Enron event during this 2 period but the count for them is less than the other status. This suggest they are not really involved in the email exchange during the SEC investigation or lesser than the other Enron worker status. Perhaps, the email text we have isn’t exhaustive, maybe the email about those event aren’t public or most of this event by CEO is manage by other way of communication such as phone call and fax.

  • For the other status at the head of the company (President and Vice-president) we can see that, we have a peack of email at the end of 2001 and the start of 2002. The higher peack, after the meeting topic, are link to the both businesses topic. In addition we see more email which speak about the enron event compared to the CEO. This suggest that, they are more involved in the general management of the company as well as the enron events than the CEO.

We do the same for the email received:

status_email_subject <- email_subject_rec %>%
  #we focus on the worker which their status are know
  filter(!is.na(status_recipient)) %>%
  #compute the sum of each topics for each year studied
  group_by(year_month, status_recipient) %>%
   mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, status_recipient, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)
#pivot the data frame
status_email_subject <- status_email_subject %>%
  pivot_longer(
    cols = 3:length(status_email_subject),
    names_to = "topic_email",
    values_to = "value")
status_list <- c("Employee", "CEO", "Manager", "Director", "Vice President", "Trader", "President", "Managing Director", "In House Lawyer")

#initiate the list to collect the plot
plot_list <- list()

#generating individual plot for each status
for(i in seq(status_list)){
  #assign the status to the variable
  status <- status_list[i]
  
  #the plot related to that status
  p <- status_email_subject %>% filter(status_recipient == status) %>%
         ggplot(aes(year_month,value, color = topic_email))+
         geom_line(size = 1) +
      scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+    
         labs(color = "Email key words and topics",
           title = paste("Email received by", status, ", content and subject analyze"),
           y = "Email count")+
         scale_color_manual(values = topic_colors,
                    labels = topic_label)+
        theme(legend.text.position = "bottom")
  
  #append the plot list
  plot_list[[i]] <- p
}


#display the plot created
n <- length(plot_list)

#number of plot per layout
plot_per_section <- 3

#create plot layouts
for (i in seq(1, n, by=plot_per_section)){
  
  plot_on_the_page <- plot_list[i:min(i+2, n)]
  
  #extract the legend from the first plot on the layout
  legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
  
  #remove the legend for all plot on the layout
  no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
  
  #display 4 plots per layout
  grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
  
  #combine together the 3 plot and one legend
  plots_with_legend <- arrangeGrob(
    grid_plot,
    legend,
    nrow = 2,
    #arrange the plot and the legend in the layout
    heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
  )
  
  #display everything together
  grid.newpage()
  grid.draw(plots_with_legend)
  
}

When we look at the email received we could see that:

  • The pattern for the email received look the same as the one for the email send suggesting most are email exchange about the same subject. In the email received for every status we can see more email which speak about the enron event suugested the person in the company are aware about what happen but maybe those email are information on what happen or direction to follow for answer to eventual questions from the investigators.

  • The CEO received more email than they send. Especially they received an important number of email about meeting, maybe because of his position, they are inform of all or most of the meeting made in the company. During the enron event they seems to received an important number of email about the core business and processes. Maybe this is to inform them about what happen in the company.

This email text and subject analyse highlight that the different status are inform about what happen in the company from the process they use for the business to the management of the investigation as well as the bankruptcy. The head of the company seems to be more inform than to be active in the email exchange about the enron event management. It seems that, the both business part of the company could be more managed by the president and vice-president than the CEO. The in house lawyer are more active in email exchange during the investigation by SEC and the bankruptcy maybe from a legal point of view.

Like for all the worker in the company we will look per status which are the words in the topics investigate which are the must see in the email subject or text. Here, we focus on the top 10 words find in both subject and text.

status_list <- c("Employee", "CEO", "Manager", "Director", "Vice President", "Trader", "President", "Managing Director", "In House Lawyer")

par(bg = "black")

for(i in seq_along(status_list)){
  
  status <- status_list[i]

df <- email_subject_send %>%
  #we focus on the worker which their status are know
  filter(status_sender == status) %>%
  #compute the sum of each topics for each year studied
  group_by(year_month, status_sender) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  filter((sum_subject_meeting != 0) | (sum_subject_business_process != 0) | (sum_subject_core_business != 0) | (sum_subject_enron_event != 0) | (sum_email_business_process != 0) | (sum_email_core_business != 0) | (sum_email_meeting != 0) | (sum_email_enron_event != 0)) %>%
  #keep one line per year and month
  distinct(status_sender, subject, reference)

#initiate the liste for storing the count for each words in text and subject
email_words_freq <- c()
subject_freq <- c()

#loop allowing to extract the words in each email text and count the number of type they are found
for(j in seq_along(word_list)){
  
  word <- as.character(word_list[[j]])
  #count for the subject
  counting_subject <- sum(str_count(df$subject, word))
  
  subject_freq <- c(subject_freq, counting_subject)
  
   #we pass through a locate to return in a list the index of the row where we find them
  counting_text <- as.list(str_locate(df$reference, word))
  
  #we count the index for which we don't have NA
  nb <- sum(!is.na(counting_text))
  
  #store the frequency for each words in the email text
  email_words_freq <- c(email_words_freq, nb)
  
}

#for each status we make a total with the count from the subject and the text
total_count <- subject_freq + email_words_freq

#draw the wordcloud with the frequency of each word, only the top 10
wordcloud(word_list, total_count, min.freq = 10 ,max.words= 10,scale = c(3, 0.5) ,col=heat.colors(length(total_count), alpha = 0.9), rot.per = 0.3)
title(main = paste0("Top 10 words in the email send by ",status), col.main = "white", font.main = 2)

}

This last analyse for the email send highlight that:

  • For all the status the top words are related to the meeting topics.

  • The employee and trader speak also about contract which we associate to the business process. Maybe this is because they are for this part of the Enron business involved in this step.

  • The CEO are the only status with in their email subject and text we can count more words related to business than meeting. This suggest it send more email which speak of business compared to organize meeting.

status_list <- c("Employee", "CEO", "Manager", "Director", "Vice President", "Trader", "President", "Managing Director", "In House Lawyer")

par(bg = "black")

for(i in seq_along(status_list)){
  
  status <- status_list[i]

df <- email_subject_rec %>%
  #we focus on the worker which their status are know
  filter(status_recipient == status) %>%
  #compute the sum of each topics for each year studied
  group_by(year_month, status_recipient) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  filter((sum_subject_meeting != 0) | (sum_subject_business_process != 0) | (sum_subject_core_business != 0) | (sum_subject_enron_event != 0) | (sum_email_business_process != 0) | (sum_email_core_business != 0) | (sum_email_meeting != 0) | (sum_email_enron_event != 0)) %>%
  #keep one line per year and month
  distinct(status_recipient, subject, reference)

#initiate the list for storing the count for each words in text and subject
email_words_freq <- c()
subject_freq <- c()

#loop allowing to extract the words in each email text and count the number of type they are found
for(j in seq_along(word_list)){
  
  word <- as.character(word_list[[j]])
  #count for the subject
  counting_subject <- sum(str_count(df$subject, word))
  
  subject_freq <- c(subject_freq, counting_subject)
  
   #we pass through a locate to return in a list the index of the row where we find them
  counting_text <- as.list(str_locate(df$reference, word))
  
  #we count the index for which we don't have NA
  nb <- sum(!is.na(counting_text))
  
  #store the frequency for each words in the email text
  email_words_freq <- c(email_words_freq, nb)
  
}

#for each status we make a total with the count from the subject and the text
total_count <- subject_freq + email_words_freq

#draw the wordcloud with the frequency of each word, only the top 10
wordcloud(word_list, total_count, min.freq = 10 ,max.words= 10,scale = c(3, 0.5) ,col=heat.colors(length(total_count), alpha = 0.9), rot.per = 0.3)
title(main = paste0("Top 10 words in the email received by ",status), col.main = "white", font.main = 2)

}

For the email received for all of the status the top 10 words are in the same topics categories than the email send. Perhaps, for the CEO we can observed in there top 10 we seen more words about meeting compared to the business. This suggest the CEO are informed about the content of the meeting such as report about them but they seems to give direction for the business process and core. This could be logic because of his position.

This last analyse highlight that, in the email for which in their subject and/or text we can find the words we search associate of specific topics, the top words we find are related to meeting. This could make sens with the period we see the peak of those topics for each status. We could think that, those email exchange are related to meeting for managing the enron event as well as the business aspect of the company.

#global environment cleaning
rm(grid_plot, i, j, n, no_legend, p, p3, p4, plot_list, plot_on_the_page, plot_per_section, plots_with_legend, status, status_list,
   status_email_subject, adjacencyData_99, adjacencyData_00, adjacencyData_01, adjacencyData_02, word, word_count, nb, legend, email_words_freq, counting, search, df, total_count, email_words_freq, subject_freq)

On the Enron scandal wikipedia page we find a list of person involved in the Enron scandal. We will research them in the data set to see if we can analyse the subject of the email they send as well as if they play a role in the Enron scandal. source: wikipedia page about Enron timeline downfall.

We find: - Kenneth Lay: he was the founder, chief executive officer, and the chairman of Enron and was heavily involved in Enron’s scandal.

  • Jeffrey Skilling: he was the CEO of the company during the scandal and deeply involved in the fraud.

  • Andrew Fastow: he was the chief financial officer and was fired shortly before the bankruptcy.

  • Lea Fastow: she was the secretary of treasure in Enron and the wife of Andrew Fastow.

  • Timothy Belden: he was the head of trading in Enron company.

  • Vincent Kaminski: he work in Enron as the head of the quantitative modelling group.

  • Jordan Mintz: he is a former managing director for the corporate tax at Enron

  • Sherron Watkins: she was one of the vice-president in Enron

  • Richard Causey: he was an accounting officer of Enron

  • Greg Whalley: he was an enron executive.

From this list we add Jeff Dasovich who isn’t find in the wikipedia page but we find it to be the most active employee in the email sending. Maybe, he could be participate at some exchange related to the Enron’s events.

#to find the person involved in the fiscal fraud we use str_detect to see if we can find them in the data set
#for example here for Vincent Kaminski
people_of_interest <- df_message_status%>% filter(str_detect(sender,"kaminski"))

First we construct the data set for the email send and received by each Enron worker know for being involved in the fraud.

#email send:
person_of_interest_send <- email_subject_send %>%
  filter(str_detect(sender,"jeff.dasovich|andrew.baker|tim.belden|andrew.fastow|lfastow|vkaminski|jordan.mintz|jeff.skilling|sherron.watkins|richard.causey|greg.whalley")) %>%
  mutate(
    #identify the person who sent the email
    email_label_sender = case_when(
      sender == "jeff.dasovich@enron.com" ~ "Jeff Dasovich",
      sender == "kenneth.lay@enron.com" ~ "Kenneth Lay",
      sender == "jeff.skilling@enron.com" ~ "Jeffrey Skilling",
      sender == "andrew.baker@enron.com" ~ "Andrew Baker",
      sender == "tim.belden@enron.com" ~ "Timothy Belden", 
      sender %in% c("lfastow@pop.pdq.net", "lfastow@pdq.net") ~ "Lea Fastow",
      sender == "andrew.fastow@enron.com" ~ "Andrew Fastow",
      sender %in% c("vkaminski@enron.com", "vkaminski@aol.com", "vkaminski@palm.net") ~ "Vincent Kaminski",
      sender == "jordan.mintz@enron.com" ~ "Jordan Mintz",
      sender == "sherron.watkins@enron.com" ~ "Sherron Watkins",
      sender == "richard.causey@enron.com" ~ "Richard Causey", #chief account officer wikipedia source
      sender == "greg.whalley@enron.com" ~ "Greg Whalley", #president and COO of Enron wholesale service
      .default = sender))

#email received
person_of_interest_reciveid <- email_subject_rec %>%
  filter(str_detect(recipient,"jeff.dasovich|andrew.baker|tim.belden|andrew.fastow|lfastow|vkaminski|jordan.mintz|jeff.skilling|sherron.watkins|richard.causey|greg.whalley")) %>%
  mutate(
    #identify the person who sent the email
    email_label_recipient = 
      case_when(
        recipient %in% c("jeff.dasovich@enron.com","jeff_dasovich@ees.enron.com") ~ "Jeff Dasovich",
        recipient == "kenneth.lay@enron.com" ~ "Kenneth Lay",
        recipient %in% c("jeff.skilling@enron.com","jeff_skilling@enron.com") ~ "Jeffrey Skilling",
        recipient == "andrew.baker@enron.com" ~ "Andrew Baker",
        recipient %in% c("tim.belden@enron.com", "tim_belden@pgn.com") ~ "Timothy Belden",
        recipient %in% c("lfastow@pop.pdq.net", "lfastow@pdq.net") ~ "Lea Fastow",
        recipient %in% c("andrew.fastow@enron.com", "andrew.fastow@ljminvestments.com") ~ "Andrew Fastow",
        recipient %in% c("vkaminski@enron.com", "vkaminski@aol.com","vkaminski@aol .com", "vkaminski@palm.net",
                         "vkaminski@aol.com") ~ "Vincent Kaminski",
        recipient %in% c("jordan.mintz@enron.com","jordan_mintz@enron.com") ~ "Jordan Mintz",
        recipient == "sherron.watkins@enron.com" ~ "Sherron Watkins",
        recipient == "richard.causey@enron.com" ~ "Richard Causey", #chief account officer wikipedia source
        recipient == "greg.whalley@enron.com" ~ "Greg Whalley", #president and COO of Enron wholesale service
        .default = recipient)) 

We look at the number of email send/received for each person studied: The email send

enron_worker_send <- c("Jeff Dasovich","Jeffrey Skilling", "Timothy Belden","Lea Fastow","Andrew Fastow",
                  "Vincent Kaminski","Jordan Mintz","Richard Causey", "Greg Whalley") 
  
  #loop allowing to construct a bar plot to display per month the number of email send in function of the worker status
worker_send_plot <- list()

for(i in seq(enron_worker_send)){
  
  worker <- enron_worker_send[i]
  
  p <- person_of_interest_send %>% filter(email_label_sender == worker) %>%
  group_by(year,month) %>%
  count() %>% 
    #bar plot
    ggplot(aes(month, n, fill = month))+
  geom_bar(stat = "identity") +
  facet_grid(~year)+
  labs(title = paste("Email send per month for each year by", worker),
       y = "Email count per month")+
  scale_fill_manual(
    values = month_color,
    labels = month_label)+
  theme(legend.position = "bottom",
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank())
  
  worker_send_plot[[i]] <- p}

worker_send_plot
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

The email received:

enron_worker_rec <- c("Jeff Dasovich", "Jeffrey Skilling", "Timothy Belden","Lea Fastow","Andrew Fastow",
                  "Vincent Kaminski","Jordan Mintz","Sherron Watkins","Richard Causey", "Greg Whalley")

  #loop allowing to construct a bar plot to display per month the number of email send in function of the worker status
worker_rec_plot <- list()

for(i in seq(enron_worker_rec)){
  
  worker <- enron_worker_rec[i]
  
  p <- person_of_interest_reciveid %>% filter(email_label_recipient == worker) %>%
  group_by(year,month) %>%
  count() %>% 
    #bar plot
    ggplot(aes(month, n, fill = month))+
  geom_bar(stat = "identity") +
  facet_grid(~year)+
  labs(title = paste("Email send per month for each year by", worker),
       y = "Email count per month")+
  scale_fill_manual(
    values = month_color,
    labels = month_label)+
  theme(legend.position = "bottom",
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.title.x = element_blank())
  
  worker_rec_plot[[i]] <- p}

worker_rec_plot
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

When we look at the number of email received/send by the Enron’s worker know for being involved in the Enron event we can see they send less email than they received. More over, the pattern of each follow the general pattern of the worker in the enron company. By adding Jeff Dasovich who we identifier earlier of potentially the most active employee in the company in the email exchange we can seee that he is one of the most active in email exchange.

Then we look at the number of email send about the topics and key words we have identify.

#extract the worker who are interesting to follow and compute the number of email send by them
person_of_interest_send_subject <- person_of_interest_send %>%
  #to compute the number of email sent in each topics by the person whose are directly involved in the Enron scandal
  group_by(year_month, email_label_sender) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, email_label_sender, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)


#pivot the table
person_of_interest_send_subject <-person_of_interest_send_subject %>%
  pivot_longer(
  cols = 3:length(person_of_interest_send_subject),
  names_to = "topic_email",
  values_to = "value"
)

For each Enron’s worker know for being involved in the different Enron’s events we will look at the number of email by create a bar plot to follow the evolution of the topics discuss over the period of study

#initiate the list to collect the plot
plot_list <- list()

#generating individual plot for each status
for(i in seq(enron_worker_send)){
  #assign the status to the variable
  worker <- enron_worker_send[i]
  
  #the plot related to that status
  p <- person_of_interest_send_subject %>% filter(email_label_sender == worker) %>% 
    ggplot(aes(year_month,value, color = topic_email))+
         geom_line(size = 1) +
         labs(color = "Email topics",
           title = paste("Email topics send by", worker),
           y = "Email count per subject topics")+
     scale_x_date(date_labels = "%Y-%m", date_breaks = "months")+ 
         scale_color_manual(values = topic_colors,
                    labels = topic_label)+
        theme(legend.text.position = "bottom")
  
  #append the plot list
  plot_list[[i]] <- p
}


#display the plot created
n <- length(plot_list)

#number of plot per layout
plot_per_section <- 3

#create plot layouts
for (i in seq(1, n, by=plot_per_section)){
  
  plot_on_the_page <- plot_list[i:min(i+2, n)]
  
  #extract the legend from the first plot on the layout
  legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
  
  #remove the legend for all plot on the layout
  no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
  
  #display 4 plots per layout
  grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
  
  #combine together the 3 plot and one legend
  plots_with_legend <- arrangeGrob(
    grid_plot,
    legend,
    nrow = 2,
    #arrange the plot and the legend in the layout
    heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
  )
  
  #display everything together
  grid.newpage()
  grid.draw(plots_with_legend)
  
}

We can see that:

  • Jeff Dasovich is really the most active enron worker in this shotr list for sending email. He send email about all topics especially meeting and the both business aspect. Maybe he could be one of the employee involved in the different event and/or who manage them. Maybe in the employee he can be one with a high level of responsibility.

  • The other worker at enron pointed to be involved in the enron events send few email about those topic (no more than 15). Maybe, because the email text data aren’t exhaustive a lot of their email send about that are censured for the public.

  • For all of them, they send email about meeting, core business, and enron event. Surprinsingly we don’t find word associate with the core business at enron. Maybe those person are more active in the business process than the regular affair of the company.

Next we look at the number of email received about those topics

#extract the worker who are interesting to follow and compute the number of email send by them
person_of_interest_reciveid_subject <- person_of_interest_reciveid %>%
  #to compute the number of email sent in each topics by the person whose are directly involved in the Enron scandal
  group_by(year_month, email_label_recipient) %>%
  mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, email_label_recipient, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)

#pivot the table
person_of_interest_reciveid_subject <-person_of_interest_reciveid_subject %>%
  pivot_longer(
  cols = 3:length(person_of_interest_reciveid_subject),
  names_to = "topic_email",
  values_to = "value"
)

Display the email received about those topics for each Enron’s worker knows to be imply in the Enron events

#initiate the list to collect the plot
plot_list <- list()

#generating individual plot for each status
for(i in seq(enron_worker_rec)){
  #assign the status to the variable
  worker <- enron_worker_rec[i]
  
  #the plot related to that status
  p <- person_of_interest_reciveid_subject %>% filter(email_label_recipient == worker)%>% 
    ggplot(aes(year_month,value, color = topic_email))+
         geom_line(size = 1) +
    scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
         labs(color = "Email content key words and topics",
           title = paste("Email received about Enron's event and function by", worker),
           y = "Email count per category research")+
         scale_color_manual(values = topic_colors,
                    labels = topic_label)+
        theme(legend.text.position = "bottom")
  
  #append the plot list
  plot_list[[i]] <- p
}


#display the plot created
n <- length(plot_list)

#number of plot per layout
plot_per_section <- 3

#create plot layouts
for (i in seq(1, n, by=plot_per_section)){
  
  plot_on_the_page <- plot_list[i:min(i+2, n)]
  
  #extract the legend from the first plot on the layout
  legend <- get_legend(plot_on_the_page[[1]], nrow = 2)
  
  #remove the legend for all plot on the layout
  no_legend <- lapply(plot_on_the_page, function(p) p + theme(legend.position = "none"))
  
  #display 4 plots per layout
  grid_plot <- arrangeGrob(grobs = no_legend, ncol = 2)
  
  #combine together the 3 plot and one legend
  plots_with_legend <- arrangeGrob(
    grid_plot,
    legend,
    nrow = 2,
    #arrange the plot and the legend in the layout
    heights = unit.c(unit(1,"npc") - unit(7, "lines"), unit(7,"lines"))
  )
  
  #display everything together
  grid.newpage()
  grid.draw(plots_with_legend)
  
}

We can observed that:

  • All received more email which are about or speak about the enron event and both business part showing they are more informed than active in the email exchange about those topic. This is true for every one except Jeff Dasovich where he received and send a similar number of email related to those topics.

  • Timothy Belden and Vincent Kaminski, after the meeting topic, they received more email about the business process compared to other topics. This maybe is due to their role in the company and suggest they are the more informed in this group about the business process.

From this analyse we can deduce that, Jeff Dasovich is highly active in the email exchange in all the topics investigate here. The other person for who we look at the email subject and content seems to be more passive than active in the email exchange. In fact, they send few email about those topics compared to the number they received. in the email received, an important part concern the business process as well as meeting. This suggest that, those person are aware in how the company manage it’s business and maybe participate to meeting about them.

The external exchange

When we start to explore the data set we pointed that, it as average 1% of the email exchange where the sender and the receiver haven’t a Enron email address. Potential those person are external to the company and could speak about the event. We can imagine that, external person involved in internal email exchange could speak about what does the Enron worker in the company with external person. In this part we will explore this hypothesis.

#extraction of the email exchange whose not involved the enron worker
extern_email <- df_message_status %>% select(date, year, month, sender, recipient, subject, reference) %>% 
    #count for each the sender and recipient whose have an enron email address
    mutate(count_sender = if_else(str_detect(sender, "@enron"), 1, 0),
  count_recipient = if_else(str_detect(recipient, "@enron"), 1,0)) %>% 
    #for each date and subject for each date make the sum of the sender and recipient with an enron email address
    group_by(date, subject) %>% mutate(
      sum_sender = sum(count_sender),
      sum_recipient = sum(count_recipient)) %>% ungroup() %>%
    #isolate the email exchange which not involved person with an enron email address
    filter((sum_sender ==0) & (sum_recipient == 0)) %>% select(-c(count_sender, count_recipient, sum_sender, sum_recipient)) %>%
  #transform all the string variable into factor data type
  transform(sender = as.factor(sender),
            recipient = as.factor(recipient))
summary(extern_email)
##       date              year           month     
##  Min.   :1999-09-19   1999:  870   10     :4347  
##  1st Qu.:2000-12-03   2000: 6879   11     :4209  
##  Median :2001-05-25   2001:15653   12     :3818  
##  Mean   :2001-05-10   2002: 1810   09     :2620  
##  3rd Qu.:2001-10-26                05     :1811  
##  Max.   :2002-12-21                04     :1696  
##                                    (Other):6711  
##                                 sender     
##  owner-eveningmba@haas.berkeley.edu:  910  
##  naftcorp@aol.com                  :  897  
##  jbennett@gmssr.com                :  889  
##  berk@haas.berkeley.edu            :  871  
##  duggar@haas.berkeley.edu          :  761  
##  feedback@intcx.com                :  611  
##  (Other)                           :20273  
##                         recipient    
##  Undisclosed-Recipient       :  838  
##  eveningmba@haas.berkeley.edu:  431  
##  soblander@carrfut.com       :  372  
##  tie_list_server@nyiso.com   :  283  
##  marketing@nymex.com         :  275  
##  linguaphile@wordsmith.org   :  265  
##  (Other)                     :22748  
##                                                                                             subject     
##  Quantitative Finance Update from FinMath.com @ Chicago                                         :  897  
##  NYS Reliability Council Executive Committee                                                    :  515  
##  Brief of Enron Energy Service Inc. on Rate Design -- A. 00-11-038                              :  445  
##  looking for key players to form a founding team of startup                                     :  298  
##  Comments of Enron Energy Services on Proposed and Alternate Decis\tions -- A. 00-11-038, et al.:  230  
##  Errata To the Rate Design Testimony of Enron Energy Services Inc.                              :  214  
##  (Other)                                                                                        :22613  
##   reference        
##  Length:25212      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

By looking at the data summary we can see that:

  • those email seems to be send mostly in 2001 because the median is 2001-05-10 and the 3rd quantile is 2001-10-26.

  • the email address for the sender who appear the most is with a domain of the berkley university. For the recipient we don’t know the email address of the top receiver.

  • on the top subject we can see that 2 of them speak about enron.

This let us think we could investigate more in this email exchange to see if they speak to the Enron event. For that we use the same topic and key word as in the main table.

extern_email_graph <- extern_email %>% distinct(date, year, month, sender, recipient, subject, reference) %>% 
  #filter for the email having in their subject enron
  filter(str_detect(subject, "enron|Enron") | str_detect(reference, "enron|Enron")) %>%
   mutate(#count the number of email which contain at least one word in the list of each topic
    subject_meeting = if_else(str_detect(subject, topic_meeting), 1, 0),
    subject_business_process = if_else(str_detect(subject, topic_business_process), 1, 0),
    subject_core_business = if_else(str_detect(subject, topic_core_business), 1, 0),
    subject_enron_event = if_else(str_detect(subject, topic_enron_event), 1, 0),
    email_meeting = if_else(str_detect(reference,topic_meeting), 1, 0),
    email_business_process = if_else(str_detect(reference, topic_business_process), 1, 0),
    email_core_business = if_else(str_detect(reference, topic_core_business), 1, 0),
    email_enron_event = if_else(str_detect(reference, topic_enron_event), 1, 0),
    #to get the date in year/month
    year_month = as.Date(paste0(year,"-",month,"-01"))) %>%
  group_by(year_month) %>%
   mutate(
    sum_subject_meeting = sum(subject_meeting),
    sum_subject_business_process = sum(subject_business_process),
    sum_subject_core_business = sum(subject_core_business),
    sum_subject_enron_event = sum(subject_enron_event),
    #for the email we use na.rm = TRUE to allow the sum to be done
    sum_email_business_process = sum(email_business_process, na.rm = TRUE),
    sum_email_core_business = sum(email_core_business, na.rm = TRUE),
    sum_email_meeting = sum(email_meeting, na.rm = TRUE),
    sum_email_enron_event = sum(email_enron_event, na.rm = TRUE)) %>% ungroup() %>%
  #keep one line per year and month
  distinct(year_month, subject, sum_subject_meeting, sum_subject_business_process, sum_subject_core_business, sum_subject_enron_event, 
           sum_email_business_process,sum_email_core_business,sum_email_meeting,sum_email_enron_event)

#graph of the email speaking about enron and which could be speaking about enron event/business process  
extern_email_graph %>% select(-subject) %>%
  #change the orientation of the data set
  pivot_longer(
  cols = 2:9,
  names_to = "topics",
  values_to = "value") %>%
  #scatter plot and trend line
  ggplot(aes(year_month,value, color=topics))+
  geom_line(size = 1)+
  #label, axis, and legend
  labs(color = "Email topics",
    title = "Email subject and content about enron event",
    subtitle = "Email exchange about Enron between person whose haven't an enron email address",
       x = "year",
       y = "Number of email") +
  #to display the year and month, every 3 months for a better reading
  scale_x_date(date_labels = "%Y-%m", date_breaks = "3 months")+
  scale_color_manual(#to get only the customization for the email categories
    values = topic_colors,
    labels = topic_label)

The graph above show us that, some email between person without enron email address exchange about the Ernon event especially their business process, less speak about the core business of the company. Those email are mostly send between october 2001 and January 2002 which is the period of the Enron fraud investigation by the SEC. Inside the email content we don’t find the key words related to those events.

#isolate the subject about enron and their event
Enron_subject <- extern_email_graph %>% 
  filter(str_detect(subject, "enron|Enron")) %>% 
  filter((sum_subject_meeting != 0) | (sum_subject_business_process != 0) | (sum_subject_core_business != 0) | (sum_subject_enron_event != 0)) %>% distinct(year_month, subject, .keep_all = TRUE)

#drop the line whose seems to be extern exchange
no_extern <- df_message_status %>% select(date, sender, recipient, subject, reference) %>% 
    #count for each the sender and recipient whose have an enron email address
    mutate(count_sender = if_else(str_detect(sender, "@enron"), 1, 0),
  count_recipient = if_else(str_detect(recipient, "@enron"), 1,0)) %>% 
    #for each date and subject for each date make the sum of the sender and recipient with an enron email address
    group_by(date, subject) %>% mutate(
      sum_sender = sum(count_sender),
      sum_recipient = sum(count_recipient)) %>% ungroup() %>%
    #isolate the email exchange which not involved person with an enron email address
    filter((sum_sender !=0) | (sum_recipient != 0)) %>% select(-c(count_sender, count_recipient, sum_sender, sum_recipient)) %>%
  #transform all the string variable into factor data type
  transform(sender = as.factor(sender),
            recipient = as.factor(recipient))

#inner joint with the main table to see if we can find those subject in exchange between enron employee
print(verify <- inner_join(no_extern, Enron_subject, by = "subject"))
##         date                  sender                recipient
## 1 2002-01-04 david.forster@enron.com louise.kitchen@enron.com
## 2 2001-12-07        louise@enron.com         louise@enron.com
##                                                                            subject
## 1                                                            EnronOnline Documents
## 2 NYTimes.com Article: Enron Paid Out  Retention  Bonuses Before Bankruptcy Filing
##   reference year_month sum_subject_meeting sum_subject_business_process
## 1      <NA> 2001-12-01                   0                            0
## 2      <NA> 2001-12-01                   0                            0
##   sum_subject_core_business sum_subject_enron_event sum_email_business_process
## 1                         1                       1                          1
## 2                         1                       1                          1
##   sum_email_core_business sum_email_meeting sum_email_enron_event
## 1                       0                 1                     1
## 2                       0                 1                     1

We can see that 2 subject are find in the external and the data set which look only at the exchange involving person with an enron email address. Those email are send in december 2001 and January 2002, one is from the CEO david foster and is about enron online document, the second is from a louise at enron and is related to an article about the bankruptcy at enron. We can think that, those email had involved person whose are external too the enron company and have spread those information outside the company.

To conclud on the project, we can say that: The Enron company is compose of different status which seems to have a different degree of involvement in the fiscal fraud. The person at the head of the company as well as the trader and the lawyer seems to be actor of the fraud. The other status seems to be more aware of it with maybe not playing a high role in it. By looking at the person which are know to be involve in the Enron fiscal fraud we don’t identify many email send or received about it as well as the management of the bankruptcy or the SEC investigation. We can think they used other way for communicate. For the time we could thing they comunicate more by phone than email. A brief investigation about potential external exchange show us that, other company in the US speak about the enron event and 2 email are directly associate with company intern exchange. It could be interesting to more investigate the email content by having a dataset more exhaustive about them. This will enhance the knowledge an the enron’s event as well as the implication of the different status in them.